Getting Started
Elasticsearch is a highly scalable open-source full-text search and analytics engine. It allows you to store, search, and analyze big volumes of data quickly and in near real time. It is generally used as the underlying engine/technology that powers applications that have complex search features and requirements.
Here are a few sample use-cases that Elasticsearch could be used for:
-
You run an online web store where you allow your customers to search for products that you sell. In this case, you can use Elasticsearch to store your entire product catalog and inventory and provide search and autocomplete suggestions for them.
-
You want to collect log or transaction data and you want to analyze and mine this data to look for trends, statistics, summarizations, or anomalies. In this case, you can use Logstash (part of the Elasticsearch/Logstash/Kibana stack) to collect, aggregate, and parse your data, and then have Logstash feed this data into Elasticsearch. Once the data is in Elasticsearch, you can run searches and aggregations to mine any information that is of interest to you.
-
You run a price alerting platform which allows price-savvy customers to specify a rule like "I am interested in buying a specific electronic gadget and I want to be notified if the price of gadget falls below $X from any vendor within the next month". In this case you can scrape vendor prices, push them into Elasticsearch and use its reverse-search (Percolator) capability to match price movements against customer queries and eventually push the alerts out to the customer once matches are found.
-
You have analytics/business-intelligence needs and want to quickly investigate, analyze, visualize, and ask ad-hoc questions on a lot of data (think millions or billions of records). In this case, you can use Elasticsearch to store your data and then use Kibana (part of the Elasticsearch/Logstash/Kibana stack) to build custom dashboards that can visualize aspects of your data that are important to you. Additionally, you can use the Elasticsearch aggregations functionality to perform complex business intelligence queries against your data.
For the rest of this tutorial, I will guide you through the process of getting Elasticsearch up and running, taking a peek inside it, and performing basic operations like indexing, searching, and modifying your data. At the end of this tutorial, you should have a good idea of what Elasticsearch is, how it works, and hopefully be inspired to see how you can use it to either build sophisticated search applications or to mine intelligence from your data.
1. Basic Concepts
There are a few concepts that are core to Elasticsearch. Understanding these concepts from the outset will tremendously help ease the learning process.
Near Realtime (NRT)
Elasticsearch is a near real time search platform. What this means is there is a slight latency (normally one second) from the time you index a document until the time it becomes searchable.
Cluster
A cluster is a collection of one or more nodes (servers) that together holds your entire data and provides federated indexing and search capabilities across all nodes. A cluster is identified by a unique name which by default is "elasticsearch". This name is important because a node can only be part of a cluster if the node is set up to join the cluster by its name.
Make sure that you don’t reuse the same cluster names in different
environments, otherwise you might end up with nodes joining the wrong cluster.
For instance you could use logging-dev, logging-stage, and logging-prod
for the development, staging, and production clusters.
Note that it is valid and perfectly fine to have a cluster with only a single node in it. Furthermore, you may also have multiple independent clusters each with its own unique cluster name.
Node
A node is a single server that is part of your cluster, stores your data, and participates in the cluster’s indexing and search capabilities. Just like a cluster, a node is identified by a name which by default is a random Marvel character name that is assigned to the node at startup. You can define any node name you want if you do not want the default. This name is important for administration purposes where you want to identify which servers in your network correspond to which nodes in your Elasticsearch cluster.
A node can be configured to join a specific cluster by the cluster name. By default, each node is set up to join a cluster named elasticsearch which means that if you start up a number of nodes on your network and—assuming they can discover each other—they will all automatically form and join a single cluster named elasticsearch.
In a single cluster, you can have as many nodes as you want. Furthermore, if there are no other Elasticsearch nodes currently running on your network, starting a single node will by default form a new single-node cluster named elasticsearch.
Index
An index is a collection of documents that have somewhat similar characteristics. For example, you can have an index for customer data, another index for a product catalog, and yet another index for order data. An index is identified by a name (that must be all lowercase) and this name is used to refer to the index when performing indexing, search, update, and delete operations against the documents in it.
In a single cluster, you can define as many indexes as you want.
Type
Within an index, you can define one or more types. A type is a logical category/partition of your index whose semantics is completely up to you. In general, a type is defined for documents that have a set of common fields. For example, let’s assume you run a blogging platform and store all your data in a single index. In this index, you may define a type for user data, another type for blog data, and yet another type for comments data.
Document
A document is a basic unit of information that can be indexed. For example, you can have a document for a single customer, another document for a single product, and yet another for a single order. This document is expressed in JSON (JavaScript Object Notation) which is an ubiquitous internet data interchange format.
Within an index/type, you can store as many documents as you want. Note that although a document physically resides in an index, a document actually must be indexed/assigned to a type inside an index.
Shards & Replicas
An index can potentially store a large amount of data that can exceed the hardware limits of a single node. For example, a single index of a billion documents taking up 1TB of disk space may not fit on the disk of a single node or may be too slow to serve search requests from a single node alone.
To solve this problem, Elasticsearch provides the ability to subdivide your index into multiple pieces called shards. When you create an index, you can simply define the number of shards that you want. Each shard is in itself a fully-functional and independent "index" that can be hosted on any node in the cluster.
Sharding is important for two primary reasons:
-
It allows you to horizontally split/scale your content volume
-
It allows you to distribute and parallelize operations across shards (potentially on multiple nodes) thus increasing performance/throughput
The mechanics of how a shard is distributed and also how its documents are aggregated back into search requests are completely managed by Elasticsearch and is transparent to you as the user.
In a network/cloud environment where failures can be expected anytime, it is very useful and highly recommended to have a failover mechanism in case a shard/node somehow goes offline or disappears for whatever reason. To this end, Elasticsearch allows you to make one or more copies of your index’s shards into what are called replica shards, or replicas for short.
Replication is important for two primary reasons:
-
It provides high availability in case a shard/node fails. For this reason, it is important to note that a replica shard is never allocated on the same node as the original/primary shard that it was copied from.
-
It allows you to scale out your search volume/throughput since searches can be executed on all replicas in parallel.
To summarize, each index can be split into multiple shards. An index can also be replicated zero (meaning no replicas) or more times. Once replicated, each index will have primary shards (the original shards that were replicated from) and replica shards (the copies of the primary shards). The number of shards and replicas can be defined per index at the time the index is created. After the index is created, you may change the number of replicas dynamically anytime but you cannot change the number shards after-the-fact.
By default, each index in Elasticsearch is allocated 5 primary shards and 1 replica which means that if you have at least two nodes in your cluster, your index will have 5 primary shards and another 5 replica shards (1 complete replica) for a total of 10 shards per index.
|
|
Each Elasticsearch shard is a Lucene index. There is a maximum number of documents you can have in a single Lucene index. As of LUCENE-5843, the limit is 2,147,483,519 (= Integer.MAX_VALUE - 128) documents.
You can monitor shard sizes using the _cat/shards api.
|
With that out of the way, let’s get started with the fun part…
2. Installation
Elasticsearch requires at least Java 7. Specifically as of this writing, it is recommended that you use the Oracle JDK version 1.8.0_73. Java installation varies from platform to platform so we won’t go into those details here. Oracle’s recommended installation documentation can be found on Oracle’s website. Suffice to say, before you install Elasticsearch, please check your Java version first by running (and then install/upgrade accordingly if needed):
java -version
echo $JAVA_HOME
Once we have Java set up, we can then download and run Elasticsearch. The binaries are available from www.elastic.co/downloads along with all the releases that have been made in the past. For each release, you have a choice among a zip or tar archive, or a DEB or RPM package. For simplicity, let’s use the tar file.
Let’s download the Elasticsearch 2.3.0 tar as follows (Windows users should download the zip package):
curl -L -O https://download.elastic.co/elasticsearch/release/org/elasticsearch/distribution/tar/elasticsearch/2.3.0/elasticsearch-2.3.0.tar.gz
Then extract it as follows (Windows users should unzip the zip package):
tar -xvf elasticsearch-2.3.0.tar.gz
It will then create a bunch of files and folders in your current directory. We then go into the bin directory as follows:
cd elasticsearch-2.3.0/bin
And now we are ready to start our node and single cluster (Windows users should run the elasticsearch.bat file):
./elasticsearch
If everything goes well, you should see a bunch of messages that look like below:
./elasticsearch
[2014-03-13 13:42:17,218][INFO ][node ] [New Goblin] version[2.3.0], pid[2085], build[5c03844/2014-02-25T15:52:53Z]
[2014-03-13 13:42:17,219][INFO ][node ] [New Goblin] initializing ...
[2014-03-13 13:42:17,223][INFO ][plugins ] [New Goblin] loaded [], sites []
[2014-03-13 13:42:19,831][INFO ][node ] [New Goblin] initialized
[2014-03-13 13:42:19,832][INFO ][node ] [New Goblin] starting ...
[2014-03-13 13:42:19,958][INFO ][transport ] [New Goblin] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/192.168.8.112:9300]}
[2014-03-13 13:42:23,030][INFO ][cluster.service] [New Goblin] new_master [New Goblin][rWMtGj3dQouz2r6ZFL9v4g][mwubuntu1][inet[/192.168.8.112:9300]], reason: zen-disco-join (elected_as_master)
[2014-03-13 13:42:23,100][INFO ][discovery ] [New Goblin] elasticsearch/rWMtGj3dQouz2r6ZFL9v4g
[2014-03-13 13:42:23,125][INFO ][http ] [New Goblin] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/192.168.8.112:9200]}
[2014-03-13 13:42:23,629][INFO ][gateway ] [New Goblin] recovered [1] indices into cluster_state
[2014-03-13 13:42:23,630][INFO ][node ] [New Goblin] started
Without going too much into detail, we can see that our node named "New Goblin" (which will be a different Marvel character in your case) has started and elected itself as a master in a single cluster. Don’t worry yet at the moment what master means. The main thing that is important here is that we have started one node within one cluster.
As mentioned previously, we can override either the cluster or node name. This can be done from the command line when starting Elasticsearch as follows:
./elasticsearch --cluster.name my_cluster_name --node.name my_node_name
Also note the line marked http with information about the HTTP address (192.168.8.112) and port (9200) that our node is reachable from. By default, Elasticsearch uses port 9200 to provide access to its REST API. This port is configurable if necessary.
3. Exploring Your Cluster
The REST API
Now that we have our node (and cluster) up and running, the next step is to understand how to communicate with it. Fortunately, Elasticsearch provides a very comprehensive and powerful REST API that you can use to interact with your cluster. Among the few things that can be done with the API are as follows:
-
Check your cluster, node, and index health, status, and statistics
-
Administer your cluster, node, and index data and metadata
-
Perform CRUD (Create, Read, Update, and Delete) and search operations against your indexes
-
Execute advanced search operations such as paging, sorting, filtering, scripting, aggregations, and many others
3.1. Cluster Health
Let’s start with a basic health check, which we can use to see how our cluster is doing. We’ll be using curl to do this but you can use any tool that allows you to make HTTP/REST calls. Let’s assume that we are still on the same node where we started Elasticsearch on and open another command shell window.
To check the cluster health, we will be using the _cat API. Remember previously that our node HTTP endpoint is available at port 9200:
curl 'localhost:9200/_cat/health?v'
And the response:
epoch timestamp cluster status node.total node.data shards pri relo init unassign
1394735289 14:28:09 elasticsearch green 1 1 0 0 0 0 0
We can see that our cluster named "elasticsearch" is up with a green status.
Whenever we ask for the cluster health, we either get green, yellow, or red. Green means everything is good (cluster is fully functional), yellow means all data is available but some replicas are not yet allocated (cluster is fully functional), and red means some data is not available for whatever reason. Note that even if a cluster is red, it still is partially functional (i.e. it will continue to serve search requests from the available shards) but you will likely need to fix it ASAP since you have missing data.
Also from the above response, we can see and total of 1 node and that we have 0 shards since we have no data in it yet. Note that since we are using the default cluster name (elasticsearch) and since Elasticsearch uses unicast network discovery by default to find other nodes on the same machine, it is possible that you could accidentally start up more than one node on your computer and have them all join a single cluster. In this scenario, you may see more than 1 node in the above response.
We can also get a list of nodes in our cluster as follows:
curl 'localhost:9200/_cat/nodes?v'
And the response:
curl 'localhost:9200/_cat/nodes?v'
host ip heap.percent ram.percent load node.role master name
mwubuntu1 127.0.1.1 8 4 0.00 d * New Goblin
Here, we can see our one node named "New Goblin", which is the single node that is currently in our cluster.
3.2. List All Indices
Now let’s take a peek at our indices:
curl 'localhost:9200/_cat/indices?v'
And the response:
curl 'localhost:9200/_cat/indices?v'
health index pri rep docs.count docs.deleted store.size pri.store.size
Which simply means we have no indices yet in the cluster.
3.3. Create an Index
Now let’s create an index named "customer" and then list all the indexes again:
curl -XPUT 'localhost:9200/customer?pretty'
curl 'localhost:9200/_cat/indices?v'
The first command creates the index named "customer" using the PUT verb. We simply append pretty to the end of the call to tell it to pretty-print the JSON response (if any).
And the response:
curl -XPUT 'localhost:9200/customer?pretty'
{
"acknowledged" : true
}
curl 'localhost:9200/_cat/indices?v'
health index pri rep docs.count docs.deleted store.size pri.store.size
yellow customer 5 1 0 0 495b 495b
The results of the second command tells us that we now have 1 index named customer and it has 5 primary shards and 1 replica (the defaults) and it contains 0 documents in it.
You might also notice that the customer index has a yellow health tagged to it. Recall from our previous discussion that yellow means that some replicas are not (yet) allocated. The reason this happens for this index is because Elasticsearch by default created one replica for this index. Since we only have one node running at the moment, that one replica cannot yet be allocated (for high availability) until a later point in time when another node joins the cluster. Once that replica gets allocated onto a second node, the health status for this index will turn to green.
3.4. Index and Query a Document
Let’s now put something into our customer index. Remember previously that in order to index a document, we must tell Elasticsearch which type in the index it should go to.
Let’s index a simple customer document into the customer index, "external" type, with an ID of 1 as follows:
Our JSON document: { "name": "John Doe" }
curl -XPUT 'localhost:9200/customer/external/1?pretty' -d '
{
"name": "John Doe"
}'
And the response:
curl -XPUT 'localhost:9200/customer/external/1?pretty' -d '
{
"name": "John Doe"
}'
{
"_index" : "customer",
"_type" : "external",
"_id" : "1",
"_version" : 1,
"created" : true
}
From the above, we can see that a new customer document was successfully created inside the customer index and the external type. The document also has an internal id of 1 which we specified at index time.
It is important to note that Elasticsearch does not require you to explicitly create an index first before you can index documents into it. In the previous example, Elasticsearch will automatically create the customer index if it didn’t already exist beforehand.
Let’s now retrieve that document that we just indexed:
curl -XGET 'localhost:9200/customer/external/1?pretty'
And the response:
curl -XGET 'localhost:9200/customer/external/1?pretty'
{
"_index" : "customer",
"_type" : "external",
"_id" : "1",
"_version" : 1,
"found" : true, "_source" : { "name": "John Doe" }
}
Nothing out of the ordinary here other than a field, found, stating that we found a document with the requested ID 1 and another field, _source, which returns the full JSON document that we indexed from the previous step.
3.5. Delete an Index
Now let’s delete the index that we just created and then list all the indexes again:
curl -XDELETE 'localhost:9200/customer?pretty'
curl 'localhost:9200/_cat/indices?v'
And the response:
curl -XDELETE 'localhost:9200/customer?pretty'
{
"acknowledged" : true
}
curl 'localhost:9200/_cat/indices?v'
health index pri rep docs.count docs.deleted store.size pri.store.size
Which means that the index was deleted successfully and we are now back to where we started with nothing in our cluster.
Before we move on, let’s take a closer look again at some of the API commands that we have learned so far:
curl -XPUT 'localhost:9200/customer'
curl -XPUT 'localhost:9200/customer/external/1' -d '
{
"name": "John Doe"
}'
curl 'localhost:9200/customer/external/1'
curl -XDELETE 'localhost:9200/customer'
If we study the above commands carefully, we can actually see a pattern of how we access data in Elasticsearch. That pattern can be summarized as follows:
curl -X<REST Verb> <Node>:<Port>/<Index>/<Type>/<ID>
This REST access pattern is pervasive throughout all the API commands that if you can simply remember it, you will have a good head start at mastering Elasticsearch.
4. Modifying Your Data
Elasticsearch provides data manipulation and search capabilities in near real time. By default, you can expect a one second delay (refresh interval) from the time you index/update/delete your data until the time that it appears in your search results. This is an important distinction from other platforms like SQL wherein data is immediately available after a transaction is completed.
Indexing/Replacing Documents
We’ve previously seen how we can index a single document. Let’s recall that command again:
curl -XPUT 'localhost:9200/customer/external/1?pretty' -d '
{
"name": "John Doe"
}'
Again, the above will index the specified document into the customer index, external type, with the ID of 1. If we then executed the above command again with a different (or same) document, Elasticsearch will replace (i.e. reindex) a new document on top of the existing one with the ID of 1:
curl -XPUT 'localhost:9200/customer/external/1?pretty' -d '
{
"name": "Jane Doe"
}'
The above changes the name of the document with the ID of 1 from "John Doe" to "Jane Doe". If, on the other hand, we use a different ID, a new document will be indexed and the existing document(s) already in the index remains untouched.
curl -XPUT 'localhost:9200/customer/external/2?pretty' -d '
{
"name": "Jane Doe"
}'
The above indexes a new document with an ID of 2.
When indexing, the ID part is optional. If not specified, Elasticsearch will generate a random ID and then use it to index the document. The actual ID Elasticsearch generates (or whatever we specified explicitly in the previous examples) is returned as part of the index API call.
This example shows how to index a document without an explicit ID:
curl -XPOST 'localhost:9200/customer/external?pretty' -d '
{
"name": "Jane Doe"
}'
Note that in the above case, we are using the POST verb instead of PUT since we didn’t specify an ID.
4.1. Updating Documents
In addition to being able to index and replace documents, we can also update documents. Note though that Elasticsearch does not actually do in-place updates under the hood. Whenever we do an update, Elasticsearch deletes the old document and then indexes a new document with the update applied to it in one shot.
This example shows how to update our previous document (ID of 1) by changing the name field to "Jane Doe":
curl -XPOST 'localhost:9200/customer/external/1/_update?pretty' -d '
{
"doc": { "name": "Jane Doe" }
}'
This example shows how to update our previous document (ID of 1) by changing the name field to "Jane Doe" and at the same time add an age field to it:
curl -XPOST 'localhost:9200/customer/external/1/_update?pretty' -d '
{
"doc": { "name": "Jane Doe", "age": 20 }
}'
Updates can also be performed by using simple scripts. Note that dynamic scripts like the following are disabled by default as of 1.4.3, have a look at the scripting docs for more details. This example uses a script to increment the age by 5:
curl -XPOST 'localhost:9200/customer/external/1/_update?pretty' -d '
{
"script" : "ctx._source.age += 5"
}'
In the above example, ctx._source refers to the current source document that is about to be updated.
Note that as of this writing, updates can only be performed on a single document at a time. In the future, Elasticsearch might provide the ability to update multiple documents given a query condition (like an SQL UPDATE-WHERE statement).
4.2. Deleting Documents
Deleting a document is fairly straightforward. This example shows how to delete our previous customer with the ID of 2:
curl -XDELETE 'localhost:9200/customer/external/2?pretty'
The delete-by-query plugin can delete all documents matching a specific query.
4.3. Batch Processing
In addition to being able to index, update, and delete individual documents, Elasticsearch also provides the ability to perform any of the above operations in batches using the _bulk API. This functionality is important in that it provides a very efficient mechanism to do multiple operations as fast as possible with as little network roundtrips as possible.
As a quick example, the following call indexes two documents (ID 1 - John Doe and ID 2 - Jane Doe) in one bulk operation:
curl -XPOST 'localhost:9200/customer/external/_bulk?pretty' -d '
{"index":{"_id":"1"}}
{"name": "John Doe" }
{"index":{"_id":"2"}}
{"name": "Jane Doe" }
'
This example updates the first document (ID of 1) and then deletes the second document (ID of 2) in one bulk operation:
curl -XPOST 'localhost:9200/customer/external/_bulk?pretty' -d '
{"update":{"_id":"1"}}
{"doc": { "name": "John Doe becomes Jane Doe" } }
{"delete":{"_id":"2"}}
'
Note above that for the delete action, there is no corresponding source document after it since deletes only require the ID of the document to be deleted.
The bulk API executes all the actions sequentially and in order. If a single action fails for whatever reason, it will continue to process the remainder of the actions after it. When the bulk API returns, it will provide a status for each action (in the same order it was sent in) so that you can check if a specific action failed or not.
5. Exploring Your Data
Sample Dataset
Now that we’ve gotten a glimpse of the basics, let’s try to work on a more realistic dataset. I’ve prepared a sample of fictitious JSON documents of customer bank account information. Each document has the following schema:
{
"account_number": 0,
"balance": 16623,
"firstname": "Bradshaw",
"lastname": "Mckenzie",
"age": 29,
"gender": "F",
"address": "244 Columbus Place",
"employer": "Euron",
"email": "bradshawmckenzie@euron.com",
"city": "Hobucken",
"state": "CO"
}
For the curious, I generated this data from www.json-generator.com/ so please ignore the actual values and semantics of the data as these are all randomly generated.
Loading the Sample Dataset
You can download the sample dataset (accounts.json) from here. Extract it to our current directory and let’s load it into our cluster as follows:
curl -XPOST 'localhost:9200/bank/account/_bulk?pretty' --data-binary "@accounts.json"
curl 'localhost:9200/_cat/indices?v'
And the response:
curl 'localhost:9200/_cat/indices?v'
health index pri rep docs.count docs.deleted store.size pri.store.size
yellow bank 5 1 1000 0 424.4kb 424.4kb
Which means that we just successfully bulk indexed 1000 documents into the bank index (under the account type).
5.1. The Search API
Now let’s start with some simple searches. There are two basic ways to run searches: one is by sending search parameters through the REST request URI and the other by sending them through the REST request body. The request body method allows you to be more expressive and also to define your searches in a more readable JSON format. We’ll try one example of the request URI method but for the remainder of this tutorial, we will exclusively be using the request body method.
The REST API for search is accessible from the _search endpoint. This example returns all documents in the bank index:
curl 'localhost:9200/bank/_search?q=*&pretty'
Let’s first dissect the search call. We are searching (_search endpoint) in the bank index, and the q=* parameter instructs Elasticsearch to match all documents in the index. The pretty parameter, again, just tells Elasticsearch to return pretty-printed JSON results.
And the response (partially shown):
curl 'localhost:9200/bank/_search?q=*&pretty'
{
"took" : 63,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1000,
"max_score" : 1.0,
"hits" : [ {
"_index" : "bank",
"_type" : "account",
"_id" : "1",
"_score" : 1.0, "_source" : {"account_number":1,"balance":39225,"firstname":"Amber","lastname":"Duke","age":32,"gender":"M","address":"880 Holmes Lane","employer":"Pyrami","email":"amberduke@pyrami.com","city":"Brogan","state":"IL"}
}, {
"_index" : "bank",
"_type" : "account",
"_id" : "6",
"_score" : 1.0, "_source" : {"account_number":6,"balance":5686,"firstname":"Hattie","lastname":"Bond","age":36,"gender":"M","address":"671 Bristol Street","employer":"Netagy","email":"hattiebond@netagy.com","city":"Dante","state":"TN"}
}, {
"_index" : "bank",
"_type" : "account",
As for the response, we see the following parts:
-
took– time in milliseconds for Elasticsearch to execute the search -
timed_out– tells us if the search timed out or not -
_shards– tells us how many shards were searched, as well as a count of the successful/failed searched shards -
hits– search results -
hits.total– total number of documents matching our search criteria -
hits.hits– actual array of search results (defaults to first 10 documents) -
_scoreandmax_score- ignore these fields for now
Here is the same exact search above using the alternative request body method:
curl -XPOST 'localhost:9200/bank/_search?pretty' -d '
{
"query": { "match_all": {} }
}'
The difference here is that instead of passing q=* in the URI, we POST a JSON-style query request body to the _search API. We’ll discuss this JSON query in the next section.
And the response (partially shown):
curl -XPOST 'localhost:9200/bank/_search?pretty' -d '
{
"query": { "match_all": {} }
}'
{
"took" : 26,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1000,
"max_score" : 1.0,
"hits" : [ {
"_index" : "bank",
"_type" : "account",
"_id" : "1",
"_score" : 1.0, "_source" : {"account_number":1,"balance":39225,"firstname":"Amber","lastname":"Duke","age":32,"gender":"M","address":"880 Holmes Lane","employer":"Pyrami","email":"amberduke@pyrami.com","city":"Brogan","state":"IL"}
}, {
"_index" : "bank",
"_type" : "account",
"_id" : "6",
"_score" : 1.0, "_source" : {"account_number":6,"balance":5686,"firstname":"Hattie","lastname":"Bond","age":36,"gender":"M","address":"671 Bristol Street","employer":"Netagy","email":"hattiebond@netagy.com","city":"Dante","state":"TN"}
}, {
"_index" : "bank",
"_type" : "account",
"_id" : "13",
It is important to understand that once you get your search results back, Elasticsearch is completely done with the request and does not maintain any kind of server-side resources or open cursors into your results. This is in stark contrast to many other platforms such as SQL wherein you may initially get a partial subset of your query results up-front and then you have to continuously go back to the server if you want to fetch (or page through) the rest of the results using some kind of stateful server-side cursor.
5.2. Introducing the Query Language
Elasticsearch provides a JSON-style domain-specific language that you can use to execute queries. This is referred to as the Query DSL. The query language is quite comprehensive and can be intimidating at first glance but the best way to actually learn it is to start with a few basic examples.
Going back to our last example, we executed this query:
{
"query": { "match_all": {} }
}
Dissecting the above, the query part tells us what our query definition is and the match_all part is simply the type of query that we want to run. The match_all query is simply a search for all documents in the specified index.
In addition to the query parameter, we also can pass other parameters to influence the search results. For example, the following does a match_all and returns only the first document:
curl -XPOST 'localhost:9200/bank/_search?pretty' -d '
{
"query": { "match_all": {} },
"size": 1
}'
Note that if size is not specified, it defaults to 10.
This example does a match_all and returns documents 11 through 20:
curl -XPOST 'localhost:9200/bank/_search?pretty' -d '
{
"query": { "match_all": {} },
"from": 10,
"size": 10
}'
The from parameter (0-based) specifies which document index to start from and the size parameter specifies how many documents to return starting at the from parameter. This feature is useful when implementing paging of search results. Note that if from is not specified, it defaults to 0.
This example does a match_all and sorts the results by account balance in descending order and returns the top 10 (default size) documents.
curl -XPOST 'localhost:9200/bank/_search?pretty' -d '
{
"query": { "match_all": {} },
"sort": { "balance": { "order": "desc" } }
}'
5.3. Executing Searches
Now that we have seen a few of the basic search parameters, let’s dig in some more into the Query DSL. Let’s first take a look at the returned document fields. By default, the full JSON document is returned as part of all searches. This is referred to as the source (_source field in the search hits). If we don’t want the entire source document returned, we have the ability to request only a few fields from within source to be returned.
This example shows how to return two fields, account_number and balance (inside of _source), from the search:
curl -XPOST 'localhost:9200/bank/_search?pretty' -d '
{
"query": { "match_all": {} },
"_source": ["account_number", "balance"]
}'
Note that the above example simply reduces the _source field. It will still only return one field named _source but within it, only the fields account_number and balance are included.
If you come from a SQL background, the above is somewhat similar in concept to the SQL SELECT FROM field list.
Now let’s move on to the query part. Previously, we’ve seen how the match_all query is used to match all documents. Let’s now introduce a new query called the match query, which can be thought of as a basic fielded search query (i.e. a search done against a specific field or set of fields).
This example returns the account numbered 20:
curl -XPOST 'localhost:9200/bank/_search?pretty' -d '
{
"query": { "match": { "account_number": 20 } }
}'
This example returns all accounts containing the term "mill" in the address:
curl -XPOST 'localhost:9200/bank/_search?pretty' -d '
{
"query": { "match": { "address": "mill" } }
}'
This example returns all accounts containing the term "mill" or "lane" in the address:
curl -XPOST 'localhost:9200/bank/_search?pretty' -d '
{
"query": { "match": { "address": "mill lane" } }
}'
This example is a variant of match (match_phrase) that returns all accounts containing the phrase "mill lane" in the address:
curl -XPOST 'localhost:9200/bank/_search?pretty' -d '
{
"query": { "match_phrase": { "address": "mill lane" } }
}'
Let’s now introduce the bool(ean) query. The bool query allows us to compose smaller queries into bigger queries using boolean logic.
This example composes two match queries and returns all accounts containing "mill" and "lane" in the address:
curl -XPOST 'localhost:9200/bank/_search?pretty' -d '
{
"query": {
"bool": {
"must": [
{ "match": { "address": "mill" } },
{ "match": { "address": "lane" } }
]
}
}
}'
In the above example, the bool must clause specifies all the queries that must be true for a document to be considered a match.
In contrast, this example composes two match queries and returns all accounts containing "mill" or "lane" in the address:
curl -XPOST 'localhost:9200/bank/_search?pretty' -d '
{
"query": {
"bool": {
"should": [
{ "match": { "address": "mill" } },
{ "match": { "address": "lane" } }
]
}
}
}'
In the above example, the bool should clause specifies a list of queries either of which must be true for a document to be considered a match.
This example composes two match queries and returns all accounts that contain neither "mill" nor "lane" in the address:
curl -XPOST 'localhost:9200/bank/_search?pretty' -d '
{
"query": {
"bool": {
"must_not": [
{ "match": { "address": "mill" } },
{ "match": { "address": "lane" } }
]
}
}
}'
In the above example, the bool must_not clause specifies a list of queries none of which must be true for a document to be considered a match.
We can combine must, should, and must_not clauses simultaneously inside a bool query. Furthermore, we can compose bool queries inside any of these bool clauses to mimic any complex multi-level boolean logic.
This example returns all accounts of anybody who is 40 years old but don’t live in ID(aho):
curl -XPOST 'localhost:9200/bank/_search?pretty' -d '
{
"query": {
"bool": {
"must": [
{ "match": { "age": "40" } }
],
"must_not": [
{ "match": { "state": "ID" } }
]
}
}
}'
5.4. Executing Filters
In the previous section, we skipped over a little detail called the document score (_score field in the search results). The score is a numeric value that is a relative measure of how well the document matches the search query that we specified. The higher the score, the more relevant the document is, the lower the score, the less relevant the document is.
But queries do not always need to produce scores, in particular when they are only used for "filtering" the document set. Elasticsearch detects these situations and automatically optimizes query execution in order not to compute useless scores.
The bool query that we introduced in the previous section also supports filter clauses which allow to use a query to restrict the documents that will be matched by other clauses, without changing how scores are computed. As an example, let’s introduce the range query, which allows us to filter documents by a range of values. This is generally used for numeric or date filtering.
This example uses a bool query to return all accounts with balances between 20000 and 30000, inclusive. In other words, we want to find accounts with a balance that is greater than or equal to 20000 and less than or equal to 30000.
curl -XPOST 'localhost:9200/bank/_search?pretty' -d '
{
"query": {
"bool": {
"must": { "match_all": {} },
"filter": {
"range": {
"balance": {
"gte": 20000,
"lte": 30000
}
}
}
}
}
}'
Dissecting the above, the bool query contains a match_all query (the query part) and a range query (the filter part). We can substitute any other queries into the query and the filter parts. In the above case, the range query makes perfect sense since documents falling into the range all match "equally", i.e., no document is more relevant than another.
In addition to the match_all, match, bool, and range queries, there are a lot of other query types that are available and we won’t go into them here. Since we already have a basic understanding of how they work, it shouldn’t be too difficult to apply this knowledge in learning and experimenting with the other query types.
5.5. Executing Aggregations
Aggregations provide the ability to group and extract statistics from your data. The easiest way to think about aggregations is by roughly equating it to the SQL GROUP BY and the SQL aggregate functions. In Elasticsearch, you have the ability to execute searches returning hits and at the same time return aggregated results separate from the hits all in one response. This is very powerful and efficient in the sense that you can run queries and multiple aggregations and get the results back of both (or either) operations in one shot avoiding network roundtrips using a concise and simplified API.
To start with, this example groups all the accounts by state, and then returns the top 10 (default) states sorted by count descending (also default):
curl -XPOST 'localhost:9200/bank/_search?pretty' -d '
{
"size": 0,
"aggs": {
"group_by_state": {
"terms": {
"field": "state"
}
}
}
}'
In SQL, the above aggregation is similar in concept to:
SELECT state, COUNT(*) FROM bank GROUP BY state ORDER BY COUNT(*) DESC
And the response (partially shown):
"hits" : {
"total" : 1000,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"group_by_state" : {
"buckets" : [ {
"key" : "al",
"doc_count" : 21
}, {
"key" : "tx",
"doc_count" : 17
}, {
"key" : "id",
"doc_count" : 15
}, {
"key" : "ma",
"doc_count" : 15
}, {
"key" : "md",
"doc_count" : 15
}, {
"key" : "pa",
"doc_count" : 15
}, {
"key" : "dc",
"doc_count" : 14
}, {
"key" : "me",
"doc_count" : 14
}, {
"key" : "mo",
"doc_count" : 14
}, {
"key" : "nd",
"doc_count" : 14
} ]
}
}
}
We can see that there are 21 accounts in AL(abama), followed by 17 accounts in TX, followed by 15 accounts in ID(aho), and so forth.
Note that we set size=0 to not show search hits because we only want to see the aggregation results in the response.
Building on the previous aggregation, this example calculates the average account balance by state (again only for the top 10 states sorted by count in descending order):
curl -XPOST 'localhost:9200/bank/_search?pretty' -d '
{
"size": 0,
"aggs": {
"group_by_state": {
"terms": {
"field": "state"
},
"aggs": {
"average_balance": {
"avg": {
"field": "balance"
}
}
}
}
}
}'
Notice how we nested the average_balance aggregation inside the group_by_state aggregation. This is a common pattern for all the aggregations. You can nest aggregations inside aggregations arbitrarily to extract pivoted summarizations that you require from your data.
Building on the previous aggregation, let’s now sort on the average balance in descending order:
curl -XPOST 'localhost:9200/bank/_search?pretty' -d '
{
"size": 0,
"aggs": {
"group_by_state": {
"terms": {
"field": "state",
"order": {
"average_balance": "desc"
}
},
"aggs": {
"average_balance": {
"avg": {
"field": "balance"
}
}
}
}
}
}'
This example demonstrates how we can group by age brackets (ages 20-29, 30-39, and 40-49), then by gender, and then finally get the average account balance, per age bracket, per gender:
curl -XPOST 'localhost:9200/bank/_search?pretty' -d '
{
"size": 0,
"aggs": {
"group_by_age": {
"range": {
"field": "age",
"ranges": [
{
"from": 20,
"to": 30
},
{
"from": 30,
"to": 40
},
{
"from": 40,
"to": 50
}
]
},
"aggs": {
"group_by_gender": {
"terms": {
"field": "gender"
},
"aggs": {
"average_balance": {
"avg": {
"field": "balance"
}
}
}
}
}
}
}
}'
There are a many other aggregations capabilities that we won’t go into detail here. The aggregations reference guide is a great starting point if you want to do further experimentation.
6. Conclusion
Elasticsearch is both a simple and complex product. We’ve so far learned the basics of what it is, how to look inside of it, and how to work with it using some of the REST APIs. I hope that this tutorial has given you a better understanding of what Elasticsearch is and more importantly, inspired you to further experiment with the rest of its great features!
Setup
This section includes information on how to setup elasticsearch and get it running. If you haven’t already, download it, and then check the installation docs.
|
|
Elasticsearch can also be installed from our repositories using apt or yum.
See Repositories.
|
Supported platforms
The matrix of officially supported operating systems and JVMs is available here: Support Matrix. Elasticsearch is tested on the listed platforms, but it is possible that it will work on other platforms too.
Installation
After downloading the latest release and extracting it, elasticsearch can be started using:
$ bin/elasticsearch
On *nix systems, the command will start the process in the foreground.
Running as a daemon
To run it in the background, add the -d switch to it:
$ bin/elasticsearch -d
PID
The Elasticsearch process can write its PID to a specified file on startup, making it easy to shut down the process later on:
$ bin/elasticsearch -d -p pid
$ kill `cat pid` 
The PID is written to a file called pid. |
|
The kill command sends a TERM signal to the PID stored in the pid file. |
|
|
The startup scripts provided for Linux and Windows take care of starting and stopping the Elasticsearch process for you. |
Java (JVM) version
Elasticsearch is built using Java, and requires at least Java 7 in order to run. Only Oracle’s Java and the OpenJDK are supported. The same JVM version should be used on all Elasticsearch nodes and clients.
We recommend installing the Java 8 update 20 or later, or Java 7 update 55 or later. Previous versions of Java 7 are known to have bugs that can cause index corruption and data loss. Elasticsearch will refuse to start if a known-bad version of Java is used.
The version of Java to use can be configured by setting the JAVA_HOME
environment variable.
7. Configuration
Environment Variables
Within the scripts, Elasticsearch comes with built in JAVA_OPTS passed
to the JVM started. The most important setting for that is the -Xmx to
control the maximum allowed memory for the process, and -Xms to
control the minimum allocated memory for the process (in general, the
more memory allocated to the process, the better).
Most times it is better to leave the default JAVA_OPTS as they are,
and use the ES_JAVA_OPTS environment variable in order to set / change
JVM settings or arguments.
The ES_HEAP_SIZE environment variable allows to set the heap memory
that will be allocated to elasticsearch java process. It will allocate
the same value to both min and max values, though those can be set
explicitly (not recommended) by setting ES_MIN_MEM (defaults to
256m), and ES_MAX_MEM (defaults to 1g).
It is recommended to set the min and max memory to the same value, and
enable mlockall.
System Configuration
File Descriptors
Make sure to increase the number of open files descriptors on the machine (or for the user running elasticsearch). Setting it to 32k or even 64k is recommended.
In order to test how many open files the process can open, start it with
-Des.max-open-files set to true. This will print the number of open
files the process can open on startup.
Alternatively, you can retrieve the max_file_descriptors for each node
using the Nodes Info API, with:
curl localhost:9200/_nodes/stats/process?pretty
Virtual memory
Elasticsearch uses a hybrid mmapfs / niofs directory by default to store its indices. The default
operating system limits on mmap counts is likely to be too low, which may
result in out of memory exceptions. On Linux, you can increase the limits by
running the following command as root:
sysctl -w vm.max_map_count=262144
To set this value permanently, update the vm.max_map_count setting in
/etc/sysctl.conf.
|
|
If you installed Elasticsearch using a package (.deb, .rpm) this setting will be changed automatically. To verify, run sysctl vm.max_map_count.
|
Memory Settings
Most operating systems try to use as much memory as possible for file system caches and eagerly swap out unused application memory, possibly resulting in the elasticsearch process being swapped. Swapping is very bad for performance and for node stability, so it should be avoided at all costs.
There are three options:
-
Disable swap
The simplest option is to completely disable swap. Usually Elasticsearch is the only service running on a box, and its memory usage is controlled by the
ES_HEAP_SIZEenvironment variable. There should be no need to have swap enabled.On Linux systems, you can disable swap temporarily by running:
sudo swapoff -a. To disable it permanently, you will need to edit the/etc/fstabfile and comment out any lines that contain the wordswap.On Windows, the equivalent can be achieved by disabling the paging file entirely via
System Properties → Advanced → Performance → Advanced → Virtual memory. -
Configure
swappinessThe second option is to ensure that the sysctl value
vm.swappinessis set to0. This reduces the kernel’s tendency to swap and should not lead to swapping under normal circumstances, while still allowing the whole system to swap in emergency conditions.
From kernel version 3.5-rc1 and above, a swappinessof0will cause the OOM killer to kill the process instead of allowing swapping. You will need to setswappinessto1to still allow swapping in emergencies. -
mlockallThe third option is to use mlockall on Linux/Unix systems, or VirtualLock on Windows, to try to lock the process address space into RAM, preventing any Elasticsearch memory from being swapped out. This can be done, by adding this line to the
config/elasticsearch.ymlfile:bootstrap.mlockall: trueAfter starting Elasticsearch, you can see whether this setting was applied successfully by checking the value of
mlockallin the output from this request:curl http://localhost:9200/_nodes/process?prettyIf you see that
mlockallisfalse, then it means that themlockallrequest has failed. The most probable reason, on Linux/Unix systems, is that the user running Elasticsearch doesn’t have permission to lock memory. This can be granted by runningulimit -l unlimitedasrootbefore starting Elasticsearch.Another possible reason why
mlockallcan fail is that the temporary directory (usually/tmp) is mounted with thenoexecoption. This can be solved by specifying a new temp directory, by starting Elasticsearch with:./bin/elasticsearch -Djna.tmpdir=/path/to/new/dir
mlockallmight cause the JVM or shell session to exit if it tries to allocate more memory than is available!
Elasticsearch Settings
elasticsearch configuration files can be found under ES_HOME/config
folder. The folder comes with two files, the elasticsearch.yml for
configuring Elasticsearch different
modules, and logging.yml for
configuring the Elasticsearch logging.
The configuration format is YAML. Here is an example of changing the address all network based modules will use to bind and publish to:
network :
host : 10.0.0.4
Paths
In production use, you will almost certainly want to change paths for data and log files:
path:
logs: /var/log/elasticsearch
data: /var/data/elasticsearch
Cluster name
Also, don’t forget to give your production cluster a name, which is used to discover and auto-join other nodes:
cluster:
name: <NAME OF YOUR CLUSTER>
Make sure that you don’t reuse the same cluster names in different
environments, otherwise you might end up with nodes joining the wrong cluster.
For instance you could use logging-dev, logging-stage, and logging-prod
for the development, staging, and production clusters.
Node name
You may also want to change the default node name for each node to something like the display hostname. By default Elasticsearch will randomly pick a Marvel character name from a list of around 3000 names when your node starts up.
node:
name: <NAME OF YOUR NODE>
The hostname of the machine is provided in the environment
variable HOSTNAME. If on your machine you only run a
single elasticsearch node for that cluster, you can set
the node name to the hostname using the ${...} notation:
node:
name: ${HOSTNAME}
Internally, all settings are collapsed into "namespaced" settings. For
example, the above gets collapsed into node.name. This means that
its easy to support other configuration formats, for example,
JSON. If JSON is a preferred configuration format,
simply rename the elasticsearch.yml file to elasticsearch.json and
add:
Configuration styles
{
"network" : {
"host" : "10.0.0.4"
}
}
It also means that its easy to provide the settings externally either
using the ES_JAVA_OPTS or as parameters to the elasticsearch
command, for example:
$ elasticsearch -Des.network.host=10.0.0.4
Another option is to set es.default. prefix instead of es. prefix,
which means the default setting will be used only if not explicitly set
in the configuration file.
Another option is to use the ${...} notation within the configuration
file which will resolve to an environment setting, for example:
{
"network" : {
"host" : "${ES_NET_HOST}"
}
}
Additionally, for settings that you do not wish to store in the configuration
file, you can use the value ${prompt.text} or ${prompt.secret} and start
Elasticsearch in the foreground. ${prompt.secret} has echoing disabled so
that the value entered will not be shown in your terminal; ${prompt.text}
will allow you to see the value as you type it in. For example:
node:
name: ${prompt.text}
On execution of the elasticsearch command, you will be prompted to enter
the actual value like so:
Enter value for [node.name]:
|
|
Elasticsearch will not start if ${prompt.text} or ${prompt.secret}
is used in the settings and the process is run as a service or in the background.
|
Index Settings
Indices created within the cluster can provide their own settings. For example, the following creates an index with a refresh interval of 5 seconds instead of the default refresh interval (the format can be either YAML or JSON):
$ curl -XPUT http://localhost:9200/kimchy/ -d \
'
index:
refresh_interval: 5s
'
Index level settings can be set on the node level as well, for example,
within the elasticsearch.yml file, the following can be set:
index :
refresh_interval: 5s
This means that every index that gets created on the specific node started with the mentioned configuration will use a refresh interval of 5 seconds unless the index explicitly sets it. In other words, any index level settings override what is set in the node configuration. Of course, the above can also be set as a "collapsed" setting, for example:
$ elasticsearch -Des.index.refresh_interval=5s
All of the index level configuration can be found within each index module.
Logging
Elasticsearch uses an internal logging abstraction and comes, out of the
box, with log4j. It tries to simplify
log4j configuration by using YAML to configure it,
and the logging configuration file is config/logging.yml. The
JSON and
properties formats are also
supported. Multiple configuration files can be loaded, in which case they will
get merged, as long as they start with the logging. prefix and end with one
of the supported suffixes (either .yml, .yaml, .json or .properties).
The logger section contains the java packages and their corresponding log
level, where it is possible to omit the org.elasticsearch prefix. The
appender section contains the destinations for the logs. Extensive information
on how to customize logging and all the supported appenders can be found on
the log4j documentation.
Additional Appenders and other logging classes provided by log4j-extras are also available, out of the box.
Deprecation logging
In addition to regular logging, Elasticsearch allows you to enable logging
of deprecated actions. For example this allows you to determine early, if
you need to migrate certain functionality in the future. By default,
deprecation logging is disabled. You can enable it in the config/logging.yml
file by setting the deprecation log level to DEBUG.
deprecation: DEBUG, deprecation_log_file
This will create a daily rolling deprecation log file in your log directory. Check this file regularly, especially when you intend to upgrade to a new major version.
8. Running as a Service on Linux
In order to run elasticsearch as a service on your operating system, the provided packages try to make it as easy as possible for you to start and stop elasticsearch during reboot and upgrades.
Linux
Currently our build automatically creates a debian package and an RPM package, which is available on the download page. The package itself does not have any dependencies, but you have to make sure that you installed a JDK.
Each package features a configuration file, which allows you to set the following parameters
ES_USER
|
The user to run as, defaults to |
ES_GROUP
|
The group to run as, defaults to |
ES_HEAP_SIZE
|
The heap size to start with |
ES_HEAP_NEWSIZE
|
The size of the new generation heap |
ES_DIRECT_SIZE
|
The maximum size of the direct memory |
MAX_OPEN_FILES
|
Maximum number of open files, defaults to |
MAX_LOCKED_MEMORY
|
Maximum locked memory size. Set to "unlimited" if you use the bootstrap.mlockall option in elasticsearch.yml. You must also set ES_HEAP_SIZE. |
MAX_MAP_COUNT
|
Maximum number of memory map areas a process may have. If you use |
LOG_DIR
|
Log directory, defaults to |
DATA_DIR
|
Data directory, defaults to |
CONF_DIR
|
Configuration file directory (which needs to include |
ES_JAVA_OPTS
|
Any additional java options you may want to apply. This may be useful, if you need to set the |
RESTART_ON_UPGRADE
|
Configure restart on package upgrade, defaults to |
ES_GC_LOG_FILE
|
The absolute log file path for creating a garbage collection logfile, which is done by the JVM. Note that this logfile can grow pretty quick and thus is disabled by default. |
Debian/Ubuntu
The debian package ships with everything you need as it uses standard debian tools like update update-rc.d to define the runlevels it runs on. The init script is placed at /etc/init.d/elasticsearch as you would expect it. The configuration file is placed at /etc/default/elasticsearch.
The debian package does not start up the service by default. The reason for this is to prevent the instance to accidentally join a cluster, without being configured appropriately. After installing using dpkg -i you can use the following commands to ensure, that elasticsearch starts when the system is booted and then start up elasticsearch:
sudo update-rc.d elasticsearch defaults 95 10
sudo /etc/init.d/elasticsearch start
Users running Debian 8 or Ubuntu 14 or later may require configuration of systemd instead of update-rc.d. In those cases, please refer to the Using systemd section.
Installing the oracle JDK
The usual recommendation is to run the Oracle JDK with elasticsearch. However Ubuntu and Debian only ship the OpenJDK due to license issues. You can easily install the oracle installer package though. In case you are missing the add-apt-repository command under Debian GNU/Linux, make sure have at least Debian Jessie and the package python-software-properties installed
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer
java -version
The last command should verify a successful installation of the Oracle JDK.
RPM based distributions
Using chkconfig
Some RPM based distributions are using chkconfig to enable and disable services. The init script is located at /etc/init.d/elasticsearch, where as the configuration file is placed at /etc/sysconfig/elasticsearch. Like the debian package the RPM package is not started by default after installation, you have to do this manually by entering the following commands
sudo /sbin/chkconfig --add elasticsearch
sudo service elasticsearch start
Using systemd
Distributions like Debian Jessie, Ubuntu 14, and many of the SUSE derivatives do not use the chkconfig tool to register services, but rather systemd and its command /bin/systemctl to start and stop services (at least in newer versions, otherwise use the chkconfig commands above). The configuration file is also placed at /etc/sysconfig/elasticsearch if the system is rpm based and /etc/default/elasticsearch if it is deb. After installing the RPM, you have to change the systemd configuration and then start up elasticsearch
sudo /bin/systemctl daemon-reload
sudo /bin/systemctl enable elasticsearch.service
sudo /bin/systemctl start elasticsearch.service
Also note that changing the MAX_MAP_COUNT setting in /etc/sysconfig/elasticsearch does not have any effect, you will have to change it in /usr/lib/sysctl.d/elasticsearch.conf in order to have it applied at startup.
9. Running as a Service on Windows
Windows users can configure Elasticsearch to run as a service to run in the background or start automatically
at startup without any user interaction.
This can be achieved through service.bat script under bin/ folder which allows one to install,
remove, manage or configure the service and potentially start and stop the service, all from the command-line.
c:\elasticsearch-2.3.0\bin>service
Usage: service.bat install|remove|start|stop|manager [SERVICE_ID]
The script requires one parameter (the command to execute) followed by an optional one indicating the service id (useful when installing multiple Elasticsearch services).
The commands available are:
install
|
Install Elasticsearch as a service |
remove
|
Remove the installed Elasticsearch service (and stop the service if started) |
start
|
Start the Elasticsearch service (if installed) |
stop
|
Stop the Elasticsearch service (if started) |
manager
|
Start a GUI for managing the installed service |
Note that the environment configuration options available during the installation are copied and will be used during the service lifecycle. This means any changes made to them after the installation will not be picked up unless the service is reinstalled.
Based on the architecture of the available JDK/JRE (set through JAVA_HOME), the appropriate 64-bit(x64) or 32-bit(x86)
service will be installed. This information is made available during install:
c:\elasticsearch-2.3.0\bin>service install
Installing service : "elasticsearch-service-x64"
Using JAVA_HOME (64-bit): "c:\jvm\jdk1.8"
The service 'elasticsearch-service-x64' has been installed.
|
|
While a JRE can be used for the Elasticsearch service, due to its use of a client VM (as oppose to a server JVM which offers better performance for long-running applications) its usage is discouraged and a warning will be issued. |
Customizing service settings
There are two ways to customize the service settings:
- Manager GUI
-
accessible through
managercommand, the GUI offers insight into the installed service including its status, startup type, JVM, start and stop settings among other things. Simply invokingservice.batfrom the command-line with the aforementioned option will open up the manager window:
- Customizing
service.bat -
at its core,
service.batrelies on Apache Commons Daemon project to install the services. For full flexibility such as customizing the user under which the service runs, one can modify the installation parameters to tweak all the parameters accordingly. Do note that this requires reinstalling the service for the new settings to be applied.
|
|
There is also a community supported customizable MSI installer available: https://github.com/salyh/elasticsearch-msi-installer (by Hendrik Saly). |
10. Directory Layout
The directory layout of an installation is as follows:
| Type | Description | Default Location | Setting |
|---|---|---|---|
home |
Home of elasticsearch installation. |
|
|
bin |
Binary scripts including |
|
|
conf |
Configuration files including |
|
|
data |
The location of the data files of each index / shard allocated on the node. Can hold multiple locations. |
|
|
logs |
Log files location. |
|
|
plugins |
Plugin files location. Each plugin will be contained in a subdirectory. |
|
|
repo |
Shared file system repository locations. Can hold multiple locations. A file system repository can be placed in to any subdirectory of any directory specified here. |
Not configured |
|
script |
Location of script files. |
|
|
Multiple data paths may be specified, in order to spread data across
multiple disks or locations, but all of the files from a single shard will be
written to the same path. This can be configured as follows:
--------------------------------- path.data: /mnt/first,/mnt/second ---------------------------------
Or in an array format:
---------------------------------------- path.data: ["/mnt/first", "/mnt/second"] ----------------------------------------
|
|
To stripe shards across multiple disks, please use a RAID driver instead. |
Default Paths
Below are the default paths that elasticsearch will use, if not explicitly changed.
deb and rpm
| Type | Description | Location Debian/Ubuntu | Location RHEL/CentOS |
|---|---|---|---|
home |
Home of elasticsearch installation. |
|
|
bin |
Binary scripts including |
|
|
conf |
Configuration files |
|
|
conf |
Environment variables including heap size, file descriptors. |
|
|
data |
The location of the data files of each index / shard allocated on the node. |
|
|
logs |
Log files location |
|
|
plugins |
Plugin files location. Each plugin will be contained in a subdirectory. |
|
|
repo |
Shared file system repository locations. |
Not configured |
Not configured |
script |
Location of script files. |
|
|
zip and tar.gz
| Type | Description | Location |
|---|---|---|
home |
Home of elasticsearch installation |
|
bin |
Binary scripts including |
|
conf |
Configuration files |
|
data |
The location of the data files of each index / shard allocated on the node |
|
logs |
Log files location |
|
plugins |
Plugin files location. Each plugin will be contained in a subdirectory |
|
repo |
Shared file system repository locations. |
Not configured |
script |
Location of script files. |
|
11. Repositories
We also have repositories available for APT and YUM based distributions. Note that we only provide binary packages, but no source packages, as the packages are created as part of the Elasticsearch build.
We have split the major versions in separate urls to avoid accidental upgrades across major version. For all 2.x releases use 2.x as version number, for 3.x.y use 3.x etc…
We use the PGP key D88E42B4, Elasticsearch Signing Key, with fingerprint
4609 5ACC 8548 582C 1A26 99A9 D27D 666C D88E 42B4
to sign all our packages. It is available from https://pgp.mit.edu.
APT
Download and install the Public Signing Key:
wget -qO - https://packages.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add -
Save the repository definition to /etc/apt/sources.list.d/elasticsearch-2.x.list:
echo "deb http://packages.elastic.co/elasticsearch/2.x/debian stable main" | sudo tee -a /etc/apt/sources.list.d/elasticsearch-2.x.list
|
|
Use the Unable to find expected entry 'main/source/Sources' in Release file (Wrong sources.list entry or malformed file) Just delete the |
Run apt-get update and the repository is ready for use. You can install it with:
sudo apt-get update && sudo apt-get install elasticsearch
|
|
If two entries exist for the same Elasticsearch repository, you will see an error like this during Duplicate sources.list entry http://packages.elastic.co/elasticsearch/2.x/debian/ …` Examine |
Configure Elasticsearch to automatically start during bootup. If your distribution is using SysV init, then you will need to run:
sudo update-rc.d elasticsearch defaults 95 10
Otherwise if your distribution is using systemd:
sudo /bin/systemctl daemon-reload
sudo /bin/systemctl enable elasticsearch.service
YUM / DNF
Download and install the public signing key:
rpm --import https://packages.elastic.co/GPG-KEY-elasticsearch
Add the following in your /etc/yum.repos.d/ directory
in a file with a .repo suffix, for example elasticsearch.repo
[elasticsearch-2.x]
name=Elasticsearch repository for 2.x packages
baseurl=https://packages.elastic.co/elasticsearch/2.x/centos
gpgcheck=1
gpgkey=https://packages.elastic.co/GPG-KEY-elasticsearch
enabled=1
And your repository is ready for use. You can install it with:
yum install elasticsearch
Or, for newer versions of Fedora and Redhat:
dnf install elasticsearch
Configure Elasticsearch to automatically start during bootup. If your
distribution is using SysV init (check with ps -p 1), then you will need to run:
|
|
The repositories do not work with older rpm based distributions that still use RPM v3, like CentOS5. |
chkconfig --add elasticsearch
Otherwise if your distribution is using systemd:
sudo /bin/systemctl daemon-reload
sudo /bin/systemctl enable elasticsearch.service
12. Upgrading
|
|
Before upgrading Elasticsearch:
|
Elasticsearch can usually be upgraded using a rolling upgrade process, resulting in no interruption of service. This section details how to perform both rolling upgrades and upgrades with full cluster restarts.
To determine whether a rolling upgrade is supported for your release, please consult this table:
| Upgrade From |
|---|
Upgrade To |
Supported Upgrade Type |
0.90.x |
2.x |
1.x |
2.x |
2.x |
2.y |
Rolling upgrade (where `y > x `) |
Upgrading Elasticsearch with Plugins
Take plugins into consideration as well when upgrading. Plugins must be upgraded alongside Elasticsearch.
Check with your plugin’s provider to ensure that the plugin is compatible with
your targeted version of Elasticsearch. If doing a rolling upgrade, it may be
worth checking as well that the plugin works across a mixed-version cluster.
Most plugins will have to be upgraded alongside Elasticsearch, although some
plugins accessed primarily through the browser (_site plugins) may continue to
work given that API changes are compatible.
The process for both Rolling upgrade and Full cluster restart is generally as follows, per node.
-
Shut down Elasticsearch
-
Upgrade Elasticsearch
-
Upgrade Plugins
-
Start up Elasticsearch
12.1. Back Up Your Data!
Always back up your data before performing an upgrade. This will allow you to roll back in the event of a problem. The upgrades sometimes include upgrades to the Lucene libraries used by Elasticsearch to access the index files, and after an index file has been updated to work with a new version of Lucene, it may not be accessible to the versions of Lucene present in earlier Elasticsearch releases.
|
|
Always back up your data before upgrading
You cannot roll back to an earlier version unless you have a backup of your data. |
12.1.1. Backing up 1.0 and later
To back up a running 1.0 or later system, it is simplest to use the snapshot feature. See the complete instructions for backup and restore with snapshots.
12.1.2. Backing up 0.90.x and earlier
To back up a running 0.90.x system:
Step 1: Disable index flushing
This will prevent indices from being flushed to disk while the backup is in process:
PUT /_all/_settings
{
"index": {
"translog.disable_flush": "true"
}
}
Step 2: Disable reallocation
This will prevent the cluster from moving data files from one node to another while the backup is in process:
PUT /_cluster/settings
{
"transient": {
"cluster.routing.allocation.enable": "none"
}
}
Step 3: Backup your data
After reallocation and index flushing are disabled, initiate a backup of Elasticsearch’s data path using your favorite backup method (tar, storage array snapshots, backup software).
Step 4: Reenable allocation and flushing
When the backup is complete and data no longer needs to be read from the Elasticsearch data path, allocation and index flushing must be re-enabled:
PUT /_all/_settings
{
"index": {
"translog.disable_flush": "false"
}
}
PUT /_cluster/settings
{
"transient": {
"cluster.routing.allocation.enable": "all"
}
}
12.2. Rolling upgrades
A rolling upgrade allows the Elasticsearch cluster to be upgraded one node at a time, with no downtime for end users. Running multiple versions of Elasticsearch in the same cluster for any length of time beyond that required for an upgrade is not supported, as shards will not be replicated from the more recent version to the older version.
Consult this table to verify that rolling upgrades are supported for your version of Elasticsearch.
To perform a rolling upgrade:
12.2.1. Step 1: Disable shard allocation
When you shut down a node, the allocation process will immediately try to replicate the shards that were on that node to other nodes in the cluster, causing a lot of wasted I/O. This can be avoided by disabling allocation before shutting down a node:
PUT /_cluster/settings
{
"transient": {
"cluster.routing.allocation.enable": "none"
}
}
12.2.2. Step 2: Stop non-essential indexing and perform a synced flush (Optional)
You may happily continue indexing during the upgrade. However, shard recovery will be much faster if you temporarily stop non-essential indexing and issue a synced-flush request:
POST /_flush/synced
A synced flush request is a “best effort” operation. It will fail if there are any pending indexing operations, but it is safe to reissue the request multiple times if necessary.
12.2.3. Step 3: Stop and upgrade a single node
Shut down one of the nodes in the cluster before starting the upgrade.
|
|
When using the zip or tarball packages, the It is a good idea to place these directories in a different location so that
there is no chance of deleting them when upgrading Elasticsearch. These
custom paths can be configured with the The Debian and RPM packages place these directories in the appropriate place for each operating system. |
To upgrade using a Debian or RPM package:
-
Use
rpmordpkgto install the new package. All files should be placed in their proper locations, and config files should not be overwritten.
To upgrade using a zip or compressed tarball:
-
Extract the zip or tarball to a new directory, to be sure that you don’t overwrite the
configordatadirectories. -
Either copy the files in the
configdirectory from your old installation to your new installation, or use the--path.confoption on the command line to point to an external config directory. -
Either copy the files in the
datadirectory from your old installation to your new installation, or configure the location of the data directory in theconfig/elasticsearch.ymlfile, with thepath.datasetting.
12.2.4. Step 4: Start the upgraded node
Start the now upgraded node and confirm that it joins the cluster by checking the log file or by checking the output of this request:
GET _cat/nodes
12.2.5. Step 5: Reenable shard allocation
Once the node has joined the cluster, reenable shard allocation to start using the node:
PUT /_cluster/settings
{
"transient": {
"cluster.routing.allocation.enable": "all"
}
}
12.2.6. Step 6: Wait for the node to recover
You should wait for the cluster to finish shard allocation before upgrading
the next node. You can check on progress with the _cat/health
request:
GET _cat/health
Wait for the status column to move from yellow to green. Status green
means that all primary and replica shards have been allocated.
|
|
During a rolling upgrade, primary shards assigned to a node with the higher version will never have their replicas assigned to a node with the lower version, because the newer version may have a different data format which is not understood by the older version. If it is not possible to assign the replica shards to another node with the
higher version — e.g. if there is only one node with the higher version in
the cluster — then the replica shards will remain unassigned and the
cluster health will remain status In this case, check that there are no initializing or relocating shards (the
As soon as another node is upgraded, the replicas should be assigned and the
cluster health will reach status |
Shards that have not been sync-flushed may take some time to
recover. The recovery status of individual shards can be monitored with the
_cat/recovery request:
GET _cat/recovery
If you stopped indexing, then it is safe to resume indexing as soon as recovery has completed.
12.3. Full cluster restart upgrade
Elasticsearch requires a full cluster restart when upgrading across major versions: from 0.x to 1.x or from 1.x to 2.x. Rolling upgrades are not supported across major versions.
The process to perform an upgrade with a full cluster restart is as follows:
12.3.1. Step 1: Disable shard allocation
When you shut down a node, the allocation process will immediately try to replicate the shards that were on that node to other nodes in the cluster, causing a lot of wasted I/O. This can be avoided by disabling allocation before shutting down a node:
PUT /_cluster/settings
{
"persistent": {
"cluster.routing.allocation.enable": "none"
}
}
If upgrading from 0.90.x to 1.x, then use these settings instead:
PUT /_cluster/settings
{
"persistent": {
"cluster.routing.allocation.disable_allocation": true,
"cluster.routing.allocation.enable": "none"
}
}
12.3.2. Step 2: Perform a synced flush
Shard recovery will be much faster if you stop indexing and issue a synced-flush request:
POST /_flush/synced
A synced flush request is a “best effort” operation. It will fail if there are any pending indexing operations, but it is safe to reissue the request multiple times if necessary.
12.3.3. Step 3: Shutdown and upgrade all nodes
Stop all Elasticsearch services on all nodes in the cluster. Each node can be upgraded following the same procedure described in Step 3: Stop and upgrade a single node.
12.3.4. Step 4: Start the cluster
If you have dedicated master nodes — nodes with node.master set to
true(the default) and node.data set to false — then it is a good idea
to start them first. Wait for them to form a cluster and to elect a master
before proceeding with the data nodes. You can check progress by looking at the
logs.
As soon as the minimum number of master-eligible nodes
have discovered each other, they will form a cluster and elect a master. From
that point on, the _cat/health and _cat/nodes
APIs can be used to monitor nodes joining the cluster:
GET _cat/health
GET _cat/nodes
Use these APIs to check that all nodes have successfully joined the cluster.
12.3.5. Step 5: Wait for yellow
As soon as each node has joined the cluster, it will start to recover any
primary shards that are stored locally. Initially, the
_cat/health request will report a status of red, meaning
that not all primary shards have been allocated.
Once each node has recovered its local shards, the status will become
yellow, meaning all primary shards have been recovered, but not all replica
shards are allocated. This is to be expected because allocation is still
disabled.
12.3.6. Step 6: Reenable allocation
Delaying the allocation of replicas until all nodes have joined the cluster allows the master to allocate replicas to nodes which already have local shard copies. At this point, with all the nodes in the cluster, it is safe to reenable shard allocation:
PUT /_cluster/settings
{
"persistent": {
"cluster.routing.allocation.enable": "all"
}
}
If upgrading from 0.90.x to 1.x, then use these settings instead:
PUT /_cluster/settings
{
"persistent": {
"cluster.routing.allocation.disable_allocation": false,
"cluster.routing.allocation.enable": "all"
}
}
The cluster will now start allocating replica shards to all data nodes. At this point it is safe to resume indexing and searching, but your cluster will recover more quickly if you can delay indexing and searching until all shards have recovered.
You can monitor progress with the _cat/health and
_cat/recovery APIs:
GET _cat/health
GET _cat/recovery
Once the status column in the _cat/health output has reached green, all
primary and replica shards have been successfully allocated.
Breaking changes
This section discusses the changes that you need to be aware of when migrating your application from one version of Elasticsearch to another.
As a general rule:
-
Migration between major versions — e.g.
1.xto2.x— requires a full cluster restart. -
Migration between minor versions — e.g.
1.xto1.y— can be performed by upgrading one node at a time.
See Upgrading for more info.
13. Breaking changes in 2.3
This section discusses the changes that you need to be aware of when migrating your application to Elasticsearch 2.3.
13.1. Mappings
13.1.1. Limit to the number of nested fields
Indexing a document with 100 nested fields actually indexes 101 documents as each nested
document is indexed as a separate document. To safeguard against ill-defined mappings
the number of nested fields that can be defined per index has been limited to 50.
This default limit can be changed with the index setting index.mapping.nested_fields.limit.
Note that the limit is only checked when new indices are created or mappings are updated. It
will thus only affect existing pre-2.3 indices if their mapping is changed.
13.2. Scripting
13.2.1. Groovy dependencies
In previous versions of Elasticsearch, the Groovy scripting capabilities
depended on the org.codehaus.groovy:groovy-all artifact. In addition
to pulling in the Groovy language, this pulls in a very large set of
functionality, none of which is needed for scripting within
Elasticsearch. Aside from the inherent difficulties in managing such a
large set of dependencies, this also increases the surface area for
security issues. This dependency has been reduced to the core Groovy
language org.codehaus.groovy:groovy artifact.
14. Breaking changes in 2.2
This section discusses the changes that you need to be aware of when migrating your application to Elasticsearch 2.2.
14.1. Mapping APIs
14.1.1. Geo Point Type
The geo_point format has been changed to reduce index size and the time required to both index and query
geo point data. To make these performance improvements possible both doc_values and coerce are required
and therefore cannot be changed. For this reason the doc_values and coerce parameters have been removed
from the geo_point field mapping.
These new geo-points are not yet supported in the percolator, but see Percolating geo-queries in Elasticsearch 2.2.0 and later for a workaround.
Scripting and security
The Java Security Manager is being used to lock down the privileges available to the scripting languages and to restrict the classes they are allowed to load to a predefined whitelist. These changes may cause scripts which worked in earlier versions to fail. See Scripting and the Java Security Manager for more details.
Field stats API
The field stats' response format has been changed for number based and date
fields. The min_value and max_value elements now return values as number
and the new min_value_as_string and max_value_as_string return the values
as string.
Default logging using systemd
In previous versions of Elasticsearch using systemd, the default logging
configuration routed standard output to /dev/null and standard error to
the journal. However, there are often critical error messages at
startup that are logged to standard output rather than standard error
and these error messages would be lost to the ether. The default has
changed to now route standard output to the journal and standard error
to inherit this setting (these are the defaults for systemd). These
settings can be modified by editing the elasticsearch.service file.
Java Client
Previously it was possible to iterate over ClusterHealthResponse to get information about ClusterIndexHealth.
While this is still possible, it requires now iterating over the values returned from getIndices():
ClusterHealthResponse clusterHealthResponse = client.admin().cluster().prepareHealth().get();
for (Map.Entry<String, ClusterIndexHealth> index : clusterHealthResponse.getIndices().entrySet()) {
String indexName = index.getKey();
ClusterIndexHealth health = index.getValue();
}
Also ClusterHealthStatus has been moved from org.elasticsearch.action.admin.cluster.health to org.elasticsearch.cluster.health.
Cloud AWS Plugin
Proxy settings have been deprecated and renamed:
-
from
cloud.aws.proxy_hosttocloud.aws.proxy.host -
from
cloud.aws.ec2.proxy_hosttocloud.aws.ec2.proxy.host -
from
cloud.aws.s3.proxy_hosttocloud.aws.s3.proxy.host -
from
cloud.aws.proxy_porttocloud.aws.proxy.port -
from
cloud.aws.ec2.proxy_porttocloud.aws.ec2.proxy.port -
from
cloud.aws.s3.proxy_porttocloud.aws.s3.proxy.port
If you are using proxy settings, update your settings as deprecated ones will be removed in next major version.
Multicast plugin deprecated
The discovery-multicast plugin has been deprecated in 2.2.0 and has
been removed in 3.0.0.
15. Breaking changes in 2.1
This section discusses the changes that you need to be aware of when migrating your application to Elasticsearch 2.1.
15.1. Search changes
15.1.1. search_type=scan deprecated
The scan search type has been deprecated. All benefits from this search
type can now be achieved by doing a scroll request that sorts documents in
_doc order, for instance:
GET /my_index/_search?scroll=2m
{
"sort": [
"_doc"
]
}
Scroll requests sorted by _doc have been optimized to more efficiently resume
from where the previous request stopped, so this will have the same performance
characteristics as the former scan search type.
15.1.2. search_type=count deprecated
The count search type has been deprecated. All benefits from this search
type can now be achieved by setting size to 0, for instance:
GET /my_index/_search
{
"aggs": {...},
"size": 0
}
15.1.3. from + size limits
Elasticsearch will now return an error message if a query’s from + size is
more than the index.max_result_window parameter. This parameter defaults to
10,000 which is safe for almost all clusters. Values higher than can consume
significant chunks of heap memory per search and per shard executing the
search. It’s safest to leave this value as it is an use the scroll api for any
deep scrolling but this setting is dynamic so it can raised or lowered as
needed.
15.2. Update changes
15.2.1. Updates now detect_noop by default
We’ve switched the default value of the detect_noop option from false to
true. This means that Elasticsearch will ignore updates that don’t change
source unless you explicitly set "detect_noop": false. detect_noop was
always computationally cheap compared to the expense of the update which can be
thought of as a delete operation followed by an index operation.
15.3. Index APIs
15.4. Removed features
15.4.1. indices.fielddata.cache.expire
The experimental feature indices.fielddata.cache.expire has been removed.
For indices that have this setting configured, this config will be ignored.
15.4.2. Forbid changing of thread pool types
Previously, thread pool types could be dynamically adjusted. The thread pool type effectively
controls the backing queue for the thread pool and modifying this is an expert setting with minimal practical benefits
and high risk of being misused. The ability to change the thread pool type for any thread pool has been removed; do note
that it is still possible to adjust relevant thread pool parameters for each of the thread pools (e.g., depending on
the thread pool type, keep_alive, queue_size, etc.).
16. Breaking changes in 2.0
This section discusses the changes that you need to be aware of when migrating your application to Elasticsearch 2.0.
Indices created before 0.90
Elasticsearch 2.0 can read indices created in version 0.90 and above. If any of your indices were created before 0.90 you will need to upgrade to the latest 1.x version of Elasticsearch first, in order to upgrade your indices or to delete the old indices. Elasticsearch will not start in the presence of old indices.
Elasticsearch migration plugin
We have provided the Elasticsearch migration plugin to help you detect any issues that you may have when upgrading to Elasticsearch 2.0. Please install and run the plugin before upgrading.
Also see
16.1. Removed features
16.1.1. Rivers have been removed
Elasticsearch does not support rivers anymore. While we had first planned to keep them around to ease migration, keeping support for rivers proved to be challenging as it conflicted with other important changes that we wanted to bring to 2.0 like synchronous dynamic mappings updates, so we eventually decided to remove them entirely. See Deprecating Rivers for more background about why we took this decision.
16.1.2. Facets have been removed
Facets, deprecated since 1.0, have now been removed. Instead, use the much more powerful and flexible aggregations framework. This also means that Kibana 3 will not work with Elasticsearch 2.0.
16.1.3. MVEL has been removed
The MVEL scripting language has been removed. The default scripting language is now Groovy.
16.1.4. Delete-by-query is now a plugin
The old delete-by-query functionality was fast but unsafe. It could lead to document differences between the primary and replica shards, and could even produce out of memory exceptions and cause the cluster to crash.
This feature has been reimplemented using the scroll and
bulk APIs, which may be slower for queries which match
large numbers of documents, but is safe.
Currently, a long running delete-by-query job cannot be cancelled, which is one of the reasons that this functionality is only available as a plugin. You can install the plugin with:
./bin/plugin install delete-by-query
See https://www.elastic.co/guide/en/elasticsearch/plugins/2.3/plugins-delete-by-query.html for more information.
16.1.5. Multicast Discovery is now a plugin
Support for multicast is very patchy. Linux doesn’t allow multicast listening on localhost, while OS/X sends multicast broadcasts across all interfaces regardless of the configured bind address. On top of that, some networks have multicast disabled by default.
This feature has been moved to a plugin. The default discovery mechanism now uses unicast, with a default setup which looks for the first 5 ports on localhost. If you still need to use multicast discovery, you can install the plugin with:
./bin/plugin install discovery-multicast
See https://www.elastic.co/guide/en/elasticsearch/plugins/2.3/discovery-multicast.html for more information.
16.1.6. _shutdown API
The _shutdown API has been removed without a replacement. Nodes should be
managed via the operating system and the provided start/stop scripts.
16.1.7. murmur3 is now a plugin
The murmur3 field, which indexes hashes of the field values, has been moved
out of core and is available as a plugin. It can be installed as:
./bin/plugin install mapper-murmur3
16.1.8. _size is now a plugin
The _size meta-data field, which indexes the size in bytes of the original
JSON document, has been moved out of core and is available as a plugin. It
can be installed as:
./bin/plugin install mapper-size
16.1.9. Thrift and memcached transport
The thrift and memcached transport plugins are no longer supported. Instead, use either the HTTP transport (enabled by default) or the node or transport Java client.
16.1.10. Bulk UDP
The bulk UDP API has been removed. Instead, use the standard
bulk API, or use UDP to send documents to Logstash first.
16.2. Network changes
16.2.1. Bind to localhost
Elasticsearch 2.x will only bind to localhost by default. It will try to bind
to both 127.0.0.1 (IPv4) and [::1] (IPv6), but will work happily in
environments where only IPv4 or IPv6 is available. This change prevents
Elasticsearch from trying to connect to other nodes on your network unless you
specifically tell it to do so. When moving to production you should configure
the network.host parameter, either in the elasticsearch.yml config file or
on the command line:
bin/elasticsearch --network.host 192.168.1.5
bin/elasticsearch --network.host _non_loopback_
The full list of options that network.host accepts can be found in the Network Settings.
16.2.2. Multicast removed
Multicast has been removed (although it is still
provided as a plugin for now). Instead,
and only when bound to localhost, Elasticsearch will use unicast to contact
the first 5 ports in the transport.tcp.port range, which defaults to
9300-9400.
This preserves the zero-config auto-clustering experience for the developer, but it means that you will have to provide a list of unicast hosts when moving to production, for instance:
discovery.zen.ping.unicast.hosts: [ 192.168.1.2, 192.168.1.3 ]
You don’t need to list all of the nodes in your cluster as unicast hosts, but you should specify at least a quorum (majority) of master-eligible nodes. A big cluster will typically have three dedicated master nodes, in which case we recommend listing all three of them as unicast hosts.
16.3. Multiple path.data striping
Previously, if the path.data setting listed multiple data paths, then a
shard would be “striped” across all paths by writing a whole file to each
path in turn (in accordance with the index.store.distributor setting). The
result was that files from a single segment in a shard could be spread across
multiple disks, and the failure of any one disk could corrupt multiple shards.
This striping is no longer supported. Instead, different shards may be allocated to different paths, but all of the files in a single shard will be written to the same path.
If striping is detected while starting Elasticsearch 2.0.0 or later, all of the files belonging to the same shard will be migrated to the same path. If there is not enough disk space to complete this migration, the upgrade will be cancelled and can only be resumed once enough disk space is made available.
The index.store.distributor setting has also been removed.
16.4. Mapping changes
A number of changes have been made to mappings to remove ambiguity and to ensure that conflicting mappings cannot be created.
One major change is that dynamically added fields must have their mapping confirmed by the master node before indexing continues. This is to avoid a problem where different shards in the same index dynamically add different mappings for the same field. These conflicting mappings can silently return incorrect results and can lead to index corruption.
This change can make indexing slower when frequently adding many new fields. We are looking at ways of optimising this process but we chose safety over performance for this extreme use case.
16.4.1. Conflicting field mappings
Fields with the same name, in the same index, in different types, must have
the same mapping, with the exception of the copy_to, dynamic,
enabled, ignore_above, include_in_all, and properties
parameters, which may have different settings per field.
PUT my_index
{
"mappings": {
"type_one": {
"properties": {
"name": {
"type": "string"
}
}
},
"type_two": {
"properties": {
"name": {
"type": "string",
"analyzer": "english"
}
}
}
}
}
The two name fields have conflicting mappings and will prevent Elasticsearch
from starting. |
Elasticsearch will not start in the presence of conflicting field mappings. These indices must be deleted or reindexed using a new mapping.
The ignore_conflicts option of the put mappings API has been removed.
Conflicts can’t be ignored anymore.
16.4.2. Fields cannot be referenced by short name
A field can no longer be referenced using its short name. Instead, the full path to the field is required. For instance:
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"title": { "type": "string" },
"name": {
"properties": {
"title": { "type": "string" },
"first": { "type": "string" },
"last": { "type": "string" }
}
}
}
}
}
}
This field is referred to as title. |
|
This field is referred to as name.title. |
Previously, the two title fields in the example above could have been
confused with each other when using the short name title.
16.4.3. Type name prefix removed
Previously, two fields with the same name in two different types could sometimes be disambiguated by prepending the type name. As a side effect, it would add a filter on the type name to the relevant query. This feature was ambiguous — a type name could be confused with a field name — and didn’t work everywhere e.g. aggregations.
Instead, fields should be specified with the full path, but without a type
name prefix. If you wish to filter by the _type field, either specify the
type in the URL or add an explicit filter.
The following example query in 1.x:
GET my_index/_search
{
"query": {
"match": {
"my_type.some_field": "quick brown fox"
}
}
}
would be rewritten in 2.0 as:
GET my_index/my_type/_search
{
"query": {
"match": {
"some_field": "quick brown fox"
}
}
}
| The type name can be specified in the URL to act as a filter. | |
| The field name should be specified without the type prefix. |
16.4.4. Field names may not contain dots
In 1.x, it was possible to create fields with dots in their name, for instance:
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"foo.bar": {
"type": "string"
},
"foo": {
"properties": {
"bar": {
"type": "string"
}
}
}
}
}
}
}
These two fields cannot be distinguished as both are referred to as foo.bar. |
You can no longer create fields with dots in the name.
16.4.5. Type names may not start with a dot
In 1.x, Elasticsearch would issue a warning if a type name included a dot,
e.g. my.type. Now that type names are no longer used to distinguish between
fields in different types, this warning has been relaxed: type names may now
contain dots, but they may not begin with a dot. The only exception to this
is the special .percolator type.
16.4.6. Type names may not be longer than 255 characters
Mapping type names may not be longer than 255 characters. Long type names will continue to function on indices created before upgrade, but it will not be possible create types with long names in new indices.
16.4.7. Types may no longer be deleted
In 1.x it was possible to delete a type mapping, along with all of the documents of that type, using the delete mapping API. This is no longer supported, because remnants of the fields in the type could remain in the index, causing corruption later on.
Instead, if you need to delete a type mapping, you should reindex to a new index which does not contain the mapping. If you just need to delete the documents that belong to that type, then use the delete-by-query plugin instead.
16.4.8. Type meta-fields
The meta-fields associated with had configuration options removed, to make them more reliable:
-
_idconfiguration can no longer be changed. If you need to sort, use the_uidfield instead. -
_typeconfiguration can no longer be changed. -
_indexconfiguration can no longer be changed. -
_routingconfiguration is limited to marking routing as required. -
_field_namesconfiguration is limited to disabling the field. -
_sizeconfiguration is limited to enabling the field. -
_timestampconfiguration is limited to enabling the field, setting format and default value. -
_boosthas been removed. -
_analyzerhas been removed.
Importantly, meta-fields can no longer be specified as part of the document
body. Instead, they must be specified in the query string parameters. For
instance, in 1.x, the routing could be specified as follows:
PUT my_index
{
"mappings": {
"my_type": {
"_routing": {
"path": "group"
},
"properties": {
"group": {
"type": "string"
}
}
}
}
}
PUT my_index/my_type/1
{
"group": "foo"
}
This 1.x mapping tells Elasticsearch to extract the routing value from the group field in the document body. |
|
This indexing request uses a routing value of foo. |
In 2.0, the routing must be specified explicitly:
PUT my_index
{
"mappings": {
"my_type": {
"_routing": {
"required": true
},
"properties": {
"group": {
"type": "string"
}
}
}
}
}
PUT my_index/my_type/1?routing=bar
{
"group": "foo"
}
| Routing can be marked as required to ensure it is not forgotten during indexing. | |
This indexing request uses a routing value of bar. |
16.4.9. _timestamp and _ttl deprecated
The _timestamp and _ttl fields are deprecated, but will remain functional
for the remainder of the 2.x series.
Instead of the _timestamp field, use a normal date field and set
the value explicitly.
The current _ttl functionality will be replaced in a future version with a
new implementation of TTL, possibly with different semantics, and will not
depend on the _timestamp field.
16.4.10. Analyzer mappings
Previously, index_analyzer and search_analyzer could be set separately,
while the analyzer setting would set both. The index_analyzer setting has
been removed in favour of just using the analyzer setting.
If just the analyzer is set, it will be used at index time and at search time. To use a different analyzer at search time, specify both the analyzer and a search_analyzer.
The index_analyzer, search_analyzer, and analyzer type-level settings
have also been removed, as it is no longer possible to select fields based on
the type name.
The _analyzer meta-field, which allowed setting an analyzer per document has
also been removed. It will be ignored on older indices.
16.4.11. Date fields and Unix timestamps
Previously, date fields would first try to parse values as a Unix timestamp — milliseconds-since-the-epoch — before trying to use their defined date
format. This meant that formats like yyyyMMdd could never work, as values
would be interpreted as timestamps.
In 2.0, we have added two formats: epoch_millis and epoch_second. Only
date fields that use these formats will be able to parse timestamps.
These formats cannot be used in dynamic templates, because they are indistinguishable from long values.
16.4.12. Default date format
The default date format has changed from date_optional_time to
strict_date_optional_time, which expects a 4 digit year, and a 2 digit month
and day, (and optionally, 2 digit hour, minute, and second).
A dynamically added date field, by default, includes the epoch_millis
format to support timestamp parsing. For instance:
PUT my_index/my_type/1
{
"date_one": "2015-01-01"
}
Has format: "strict_date_optional_time||epoch_millis". |
16.4.13. mapping.date.round_ceil setting
The mapping.date.round_ceil setting for date math parsing has been removed.
16.4.14. Boolean fields
Boolean fields used to have a string fielddata with F meaning false and T
meaning true. They have been refactored to use numeric fielddata, with 0
for false and 1 for true. As a consequence, the format of the responses of
the following APIs changed when applied to boolean fields: 0/1 is returned
instead of F/T:
In addition, terms aggregations use a custom formatter for boolean (like for
dates and ip addresses, which are also backed by numbers) in order to return
the user-friendly representation of boolean fields: false/true:
"buckets": [
{
"key": 0,
"key_as_string": "false",
"doc_count": 42
},
{
"key": 1,
"key_as_string": "true",
"doc_count": 12
}
]
16.4.15. index_name and path removed
The index_name setting was used to change the name of the Lucene field,
and the path setting was used on object fields to determine whether the
Lucene field should use the full path (including parent object fields), or
just the final name.
These setting have been removed as their purpose is better served with the
copy_to parameter.
16.4.16. Murmur3 Fields
Fields of type murmur3 can no longer change doc_values or index setting.
They are always mapped as follows:
{
"type": "murmur3",
"index": "no",
"doc_values": true
}
16.4.17. Mappings in config files not supported
The ability to specify mappings in configuration files has been removed. To specify default mappings that apply to multiple indexes, use index templates instead.
Along with this change, the following settings have been removed:
-
index.mapper.default_mapping_location -
index.mapper.default_percolator_mapping_location
16.4.18. Fielddata formats
Now that doc values are the default for fielddata, specialized in-memory formats have become an esoteric option. These fielddata formats have been removed:
-
fston string fields -
compressedon geo points
The default fielddata format will be used instead.
16.4.19. Posting and doc-values codecs
It is no longer possible to specify per-field postings and doc values formats in the mappings. This setting will be ignored on indices created before 2.0 and will cause mapping parsing to fail on indices created on or after 2.0. For old indices, this means that new segments will be written with the default postings and doc values formats of the current codec.
It is still possible to change the whole codec by using the index.codec
setting. Please however note that using a non-default codec is discouraged as
it could prevent future versions of Elasticsearch from being able to read the
index.
16.4.20. Compress and compress threshold
The compress and compress_threshold options have been removed from the
_source field and fields of type binary. These fields are compressed by
default. If you would like to increase compression levels, use the new
index.codec: best_compression setting instead.
16.4.21. position_offset_gap
The position_offset_gap option is renamed to position_increment_gap. This was
done to clear away the confusion. Elasticsearch’s position_increment_gap now is
mapped directly to Lucene’s position_increment_gap
The default position_increment_gap is now 100. Indexes created in Elasticsearch
2.0.0 will default to using 100 and indexes created before that will continue
to use the old default of 0. This was done to prevent phrase queries from
matching across different values of the same term unexpectedly. Specifically,
100 was chosen to cause phrase queries with slops up to 99 to match only within
a single value of a field.
16.4.22. copy_to and multi fields
A copy_to within a multi field is ignored from version 2.0 on. With any version after 2.1 or 2.0.1 creating a mapping that has a copy_to within a multi field will result in an exception.
16.5. CRUD and routing changes
16.5.1. Explicit custom routing
Custom routing values can no longer be extracted from the document body, but
must be specified explicitly as part of the query string, or in the metadata
line in the bulk API. See Type meta-fields for an
example.
16.5.2. Routing hash function
The default hash function that is used for routing has been changed from
djb2 to murmur3. This change should be transparent unless you relied on
very specific properties of djb2. This will help ensure a better balance of
the document counts between shards.
In addition, the following routing-related node settings have been deprecated:
cluster.routing.operation.hash.type-
This was an undocumented setting that allowed to configure which hash function to use for routing.
murmur3is now enforced on new indices. cluster.routing.operation.use_type-
This was an undocumented setting that allowed to take the
_typeof the document into account when computing its shard (default:false).falseis now enforced on new indices.
16.5.3. Delete API with custom routing
The delete API used to be broadcast to all shards in the index which meant
that, when using custom routing, the routing parameter was optional. Now,
the delete request is forwarded only to the shard holding the document. If you
are using custom routing then you should specify the routing value when
deleting a document, just as is already required for the index, create,
and update APIs.
To make sure that you never forget a routing value, make routing required with the following mapping:
PUT my_index
{
"mappings": {
"my_type": {
"_routing": {
"required": true
}
}
}
}
16.5.4. All stored meta-fields returned by default
Previously, meta-fields like _routing, _timestamp, etc would only be
included in a GET request if specifically requested with the fields
parameter. Now, all meta-fields which have stored values will be returned by
default. Additionally, they are now returned at the top level (along with
_index, _type, and _id) instead of in the fields element.
For instance, the following request:
GET /my_index/my_type/1
might return:
{
"_index": "my_index",
"_type": "my_type",
"_id": "1",
"_timestamp": 10000000,
"_source": {
"foo" : [ "bar" ]
}
}
The _timestamp is returned by default, and at the top level. |
16.5.5. Async replication
The replication parameter has been removed from all CRUD operations
(index, create, update, delete, bulk) as it interfered with the
synced flush feature. These operations are now
synchronous only and a request will only return once the changes have been
replicated to all active shards in the shard group.
Instead, use more client processes to send more requests in parallel.
16.5.6. Documents must be specified without a type wrapper
Previously, the document body could be wrapped in another object with the name
of the type:
PUT my_index/my_type/1
{
"my_type": {
"text": "quick brown fox"
}
}
This my_type wrapper is not part of the document itself, but represents the document type. |
This feature was deprecated before but could be reenabled with the
mapping.allow_type_wrapper index setting. This setting is no longer
supported. The above document should be indexed as follows:
PUT my_index/my_type/1
{
"text": "quick brown fox"
}
16.6. Query DSL changes
16.6.1. Queries and filters merged
Queries and filters have been merged — all filter clauses are now query clauses. Instead, query clauses can now be used in query context or in filter context:
- Query context
-
A query used in query context will calculate relevance scores and will not be cacheable. Query context is used whenever filter context does not apply.
- Filter context
-
A query used in filter context will not calculate relevance scores, and will be cacheable. Filter context is introduced by:
-
the
constant_scorequery -
the
must_notand (newly added)filterparameter in theboolquery -
the
filterandfiltersparameters in thefunction_scorequery -
any API called
filter, such as thepost_filtersearch parameter, or in aggregations or index aliases
-
16.6.2. terms query and filter
The execution option of the terms filter is now deprecated and is ignored
if provided. Similarly, the terms query no longer supports the
minimum_should_match parameter.
16.6.3. or and and now implemented via bool
The or and and filters previously had a different execution pattern to the
bool filter. It used to be important to use and/or with certain filter
clauses, and bool with others.
This distinction has been removed: the bool query is now smart enough to
handle both cases optimally. As a result of this change, the or and and
filters are now sugar syntax which are executed internally as a bool query.
These filters may be removed in the future.
16.6.4. filtered query and query filter deprecated
The query filter is deprecated as is it no longer needed — all queries can
be used in query or filter context.
The filtered query is deprecated in favour of the bool query. Instead of
the following:
GET _search
{
"query": {
"filtered": {
"query": {
"match": {
"text": "quick brown fox"
}
},
"filter": {
"term": {
"status": "published"
}
}
}
}
}
move the query and filter to the must and filter parameters in the bool
query:
GET _search
{
"query": {
"bool": {
"must": {
"match": {
"text": "quick brown fox"
}
},
"filter": {
"term": {
"status": "published"
}
}
}
}
}
16.6.5. Filter auto-caching
It used to be possible to control which filters were cached with the _cache
option and to provide a custom _cache_key. These options are deprecated
and, if present, will be ignored.
Query clauses used in filter context are now auto-cached when it makes sense to do so. The algorithm takes into account the frequency of use, the cost of query execution, and the cost of building the filter.
The terms filter lookup mechanism no longer caches the values of the
document containing the terms. It relies on the filesystem cache instead. If
the lookup index is not too large, it is recommended to replicate it to all
nodes by setting index.auto_expand_replicas: 0-all in order to remove the
network overhead as well.
16.6.6. Numeric queries use IDF for scoring
Previously, term queries on numeric fields were deliberately prevented from using the usual Lucene scoring logic and this behaviour was undocumented and, to some, unexpected.
Single term queries on numeric fields now score in the same way as string
fields, using IDF and norms (if enabled).
To query numeric fields without scoring, the query clause should be used in
filter context, e.g. in the filter parameter of the bool query, or wrapped
in a constant_score query:
GET _search
{
"query": {
"bool": {
"must": [
{
"match": {
"numeric_tag": 5
}
}
],
"filter": [
{
"match": {
"count": 5
}
}
]
}
}
}
| This clause would include IDF in the relevance score calculation. | |
| This clause would have no effect on the relevance score. |
16.6.7. Fuzziness and fuzzy-like-this
Fuzzy matching used to calculate the score for each fuzzy alternative, meaning that rare misspellings would have a higher score than the more common correct spellings. Now, fuzzy matching blends the scores of all the fuzzy alternatives to use the IDF of the most frequently occurring alternative.
Fuzziness can no longer be specified using a percentage, but should instead use the number of allowed edits:
-
0,1,2, or -
AUTO(which chooses0,1, or2based on the length of the term)
The fuzzy_like_this and fuzzy_like_this_field queries used a very
expensive approach to fuzzy matching and have been removed.
16.6.8. More Like This
The More Like This (mlt) API and the more_like_this_field (mlt_field)
query have been removed in favor of the
more_like_this query.
The parameter percent_terms_to_match has been removed in favor of
minimum_should_match.
16.6.9. limit filter deprecated
The limit filter is deprecated and becomes a no-op. You can achieve similar
behaviour using the terminate_after parameter.
16.7. Search changes
16.7.1. Partial fields
Partial fields have been removed in favor of source filtering.
16.7.2. search_type=count deprecated
The count search type has been deprecated. All benefits from this search
type can now be achieved by using the (default) query_then_fetch search type
and setting size to 0.
16.7.3. The count api internally uses the search api
The count api is now a shortcut to the search api with size set to 0. As a
result, a total failure will result in an exception being returned rather
than a normal response with count set to 0 and shard failures.
16.7.4. All stored meta-fields returned by default
Previously, meta-fields like _routing, _timestamp, etc would only be
included in the search results if specifically requested with the fields
parameter. Now, all meta-fields which have stored values will be returned by
default. Additionally, they are now returned at the top level (along with
_index, _type, and _id) instead of in the fields element.
For instance, the following request:
GET /my_index/_search?fields=foo
might return:
{
[...]
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "my_index",
"_type": "my_type",
"_id": "1",
"_score": 1,
"_timestamp": 10000000,
"fields": {
"foo" : [ "bar" ]
}
}
]
}
}
The _timestamp is returned by default, and at the top level. |
16.7.5. Script fields
Script fields in 1.x were only returned as a single value. Even if the return value of a script was a list, it would be returned as an array containing an array:
"fields": {
"my_field": [
[
"v1",
"v2"
]
]
}
In elasticsearch 2.0, scripts that return a list of values are treated as multivalued fields. The same example would return the following response, with values in a single array.
"fields": {
"my_field": [
"v1",
"v2"
]
}
16.7.6. Timezone for date field
Specifying the time_zone parameter in queries or aggregations on fields of
type date must now be either an ISO 8601 UTC offset, or a timezone id. For
example, the value +1:00 must now be written as +01:00.
16.7.7. Only highlight queried fields
The default value for the require_field_match option has changed from
false to true, meaning that the highlighters will, by default, only take
the fields that were queried into account.
This means that, when querying the _all field, trying to highlight on any
field other than _all will produce no highlighted snippets. Querying the
same fields that need to be highlighted is the cleaner solution to get
highlighted snippets back. Otherwise require_field_match option can be set
to false to ignore field names completely when highlighting.
The postings highlighter doesn’t support the require_field_match option
anymore, it will only highlight fields that were queried.
16.8. Aggregation changes
16.8.1. Min doc count defaults to zero
Both the histogram and date_histogram aggregations now have a default
min_doc_count of 0 instead of 1.
16.8.2. Timezone for date field
Specifying the time_zone parameter in queries or aggregations on fields of
type date must now be either an ISO 8601 UTC offset, or a timezone id. For
example, the value +1:00 must now be written as +01:00.
16.8.3. Time zones and offsets
The histogram and the date_histogram aggregation now support a simplified
offset option that replaces the previous pre_offset and post_offset
rounding options. Instead of having to specify two separate offset shifts of
the underlying buckets, the offset option moves the bucket boundaries in
positive or negative direction depending on its argument.
The date_histogram options for pre_zone and post_zone are replaced by
the time_zone option. The behavior of time_zone is equivalent to the
former pre_zone option. Setting time_zone to a value like "+01:00" now
will lead to the bucket calculations being applied in the specified time zone.
The key is returned as the timestamp in UTC, but the key_as_string is
returned in the time zone specified.
In addition to this, the pre_zone_adjust_large_interval is removed because
we now always return dates and bucket keys in UTC.
16.8.4. Including/excluding terms
include/exclude filtering on the terms aggregation now uses the same
syntax as regexp queries instead of the Java regular
expression syntax. While simple regexps should still work, more complex ones
might need some rewriting. Also, the flags parameter is no longer supported.
16.8.5. Boolean fields
Aggregations on boolean fields will now return 0 and 1 as keys, and
"true" and "false" as string keys. See Boolean fields for more
information.
16.8.6. Java aggregation classes
The date_histogram aggregation now returns a Histogram object in the
response, and the DateHistogram class has been removed. Similarly the
date_range, ipv4_range, and geo_distance aggregations all return a
Range object in the response, and the IPV4Range, DateRange, and
GeoDistance classes have been removed.
The motivation for this is to have a single response API for the Range and
Histogram aggregations regardless of the type of data being queried. To
support this some changes were made in the MultiBucketAggregation interface
which applies to all bucket aggregations:
-
The
getKey()method now returnsObjectinstead ofString. The actual object type returned depends on the type of aggregation requested (e.g. thedate_histogramwill return aDateTimeobject for this method whereas ahistogramwill return aNumber). -
A
getKeyAsString()method has been added to return the String representation of the key. -
All other
getKeyAsX()methods have been removed. -
The
getBucketAsKey(String)methods have been removed on all aggregations except thefiltersandtermsaggregations.
16.9. Parent/Child changes
Parent/child has been rewritten completely to reduce memory usage and to
execute has_child and has_parent queries faster and more efficient. The
_parent field uses doc values by default. The refactored and improved
implementation is only active for indices created on or after version 2.0.
In order to benefit from all the performance and memory improvements, we
recommend reindexing all existing indices that use the _parent field.
16.9.1. Parent type cannot pre-exist
A mapping type is declared as a child of another mapping type by specifying
the _parent meta field:
DELETE *
PUT my_index
{
"mappings": {
"my_parent": {},
"my_child": {
"_parent": {
"type": "my_parent"
}
}
}
}
The my_parent type is the parent of the my_child type. |
The mapping for the parent type can be added at the same time as the mapping for the child type, but cannot be added before the child type.
16.9.2. top_children query removed
The top_children query has been removed in favour of the has_child query.
It wasn’t always faster than the has_child query and the results were usually
inaccurate. The total hits and any aggregations in the same search request
would be incorrect if top_children was used.
16.10. Scripting changes
16.10.1. Scripting syntax
The syntax for scripts has been made consistent across all APIs. The accepted format is as follows:
- Inline/Dynamic scripts
-
"script": { "inline": "doc['foo'].value + val",
"lang": "groovy",
"params": { "val": 3 }
}
The inline script to execute. 
The optional language of the script. 
Any named parameters. - Indexed scripts
-
"script": { "id": "my_script_id",
"lang": "groovy",
"params": { "val": 3 }
}
The ID of the indexed script. 
The optional language of the script. 
Any named parameters. - File scripts
-
"script": { "file": "my_file",
"lang": "groovy",
"params": { "val": 3 }
}
The filename of the script, without the .langsuffix.
The optional language of the script. 
Any named parameters.
For example, an update request might look like this:
POST my_index/my_type/1/_update
{
"script": {
"inline": "ctx._source.count += val",
"params": { "val": 3 }
},
"upsert": {
"count": 0
}
}
A short syntax exists for running inline scripts in the default scripting language without any parameters:
GET _search
{
"script_fields": {
"concat_fields": {
"script": "doc['one'].value + ' ' + doc['two'].value"
}
}
}
16.10.2. Scripting settings
The script.disable_dynamic node setting has been replaced by fine-grained
script settings described in Scripting settings.
16.11. Index API changes
16.11.1. Index aliases
Fields used in alias filters no longer have to exist in the mapping at alias creation time. Previously, alias filters were parsed at alias creation time and the parsed form was cached in memory. Now, alias filters are parsed at request time and the fields in filters are resolved from the current mapping.
This also means that index aliases now support has_parent and has_child
queries.
The GET alias api will now throw an exception if no matching aliases are found. This change brings the defaults for this API in line with the other Indices APIs. The Multiple Indices options can be used on a request to change this behavior.
16.11.2. File based index templates
Index templates can no longer be configured on disk. Use the
_template API instead.
16.11.3. Analyze API changes
The Analyze API now returns the position of the first token as 0
instead of 1.
The prefer_local parameter has been removed. The _analyze API is a light
operation and the caller shouldn’t be concerned about whether it executes on
the node that receives the request or another node.
The text() method on AnalyzeRequest now returns String[] instead of
String.
16.11.4. Removed id_cache from clear cache api
The clear cache API no longer supports the id_cache
option. Instead, use the fielddata option to clear the cache for the
_parent field.
16.12. Snapshot and Restore changes
16.12.1. File-system repositories must be whitelisted
Locations of the shared file system repositories and the URL repositories with
file: URLs now have to be registered before starting Elasticsearch using the
path.repo setting. The path.repo setting can contain one or more
repository locations:
path.repo: ["/mnt/daily", "/mnt/weekly"]
If the repository location is specified as an absolute path it has to start
with one of the locations specified in path.repo. If the location is
specified as a relative path, it will be resolved against the first location
specified in the path.repo setting.
16.12.2. URL repositories must be whitelisted
URL repositories with http:, https:, and ftp: URLs have to be
whitelisted before starting Elasticsearch with the
repositories.url.allowed_urls setting. This setting supports wildcards in
the place of host, path, query, and fragment. For example:
repositories.url.allowed_urls: ["http://www.example.org/root/*", "https://*.mydomain.com/*?*#*"]
16.12.3. Wildcard expansion
The obsolete parameters expand_wildcards_open and expand_wildcards_close
are no longer supported by the snapshot and restore operations. These
parameters have been replaced by a single expand_wildcards parameter. See
the multi-index docs for more.
16.13. Plugin and packaging changes
16.13.1. Symbolic links and paths
Elasticsearch 2.0 runs with the Java security manager enabled and is much more
restrictive about which paths it is allowed to access. Various paths can be
configured, e.g. path.data, path.scripts, path.repo. A configured path
may itself be a symbolic link, but no symlinks under that path will be
followed.
16.13.2. Running bin/elasticsearch
The command line parameter parsing has been rewritten to deal properly with
spaces in parameters. All config settings can still be specified on the
command line when starting Elasticsearch, but they must appear after the
built-in "static parameters", such as -d (to daemonize) and -p (the PID path).
For instance:
bin/elasticsearch -d -p /tmp/foo.pid --http.cors.enabled=true --http.cors.allow-origin='*'
For a list of static parameters, run bin/elasticsearch -h
16.13.3. -f removed
The -f parameter, which used to indicate that Elasticsearch should be run in
the foreground, was deprecated in 1.0 and removed in 2.0.
16.13.4. V for version
The -v parameter now means --verbose for both bin/plugin and
bin/elasticsearch (although it has no effect on the latter). To output the
version, use -V or --version instead.
16.13.5. Plugin manager should run as root
The permissions of the config, bin, and plugins directories in the RPM
and deb packages have been made more restrictive. The plugin manager should
be run as root otherwise it will not be able to install plugins.
16.13.6. Support for official plugins
Almost all of the official Elasticsearch plugins have been moved to the main
elasticsearch repository. They will be released at the same time as
Elasticsearch and have the same version number as Elasticsearch.
Official plugins can be installed as follows:
sudo bin/plugin install analysis-icu
Community-provided plugins can be installed as before.
16.13.7. Plugins require descriptor file
All plugins are now required to have a plugin-descriptor.properties file. If a node has a plugin installed which lacks this file, it will be unable to start.
16.13.8. Repository naming structure changes
Elasticsearch 2.0 changes the way the repository URLs are referenced. Instead of specific repositories for both major and minor versions, the repositories will use a major version reference only.
The URL for apt packages now uses the following structure;
deb http://packages.elastic.co/elasticsearch/2.x/debian stable main
And for yum packages it is;
baseurl=http://packages.elastic.co/elasticsearch/2.x/centos
The repositories page details this change.
16.14. Setting changes
16.14.1. Command line flags
Command line flags using single dash notation must be now specified as the first arguments. For example if previously using:
./elasticsearch --node.name=test_node -Des.path.conf=/opt/elasticsearch/conf/test_node
This will now need to be changed to:
./elasticsearch -Des.path.conf=/opt/elasticsearch/conf/test_node --node.name=test_node
for the flag to take effect.
16.14.2. Scripting settings
The script.disable_dynamic node setting has been replaced by fine-grained
script settings described in the scripting docs.
The following setting previously used to enable dynamic or inline scripts:
script.disable_dynamic: false
It should be replaced with the following two settings in elasticsearch.yml that
achieve the same result:
script.inline: true
script.indexed: true
16.14.3. Units required for time and byte-sized settings
Any settings which accept time or byte values must now be specified with
units. For instance, it is too easy to set the refresh_interval to 1
millisecond instead of 1 second:
PUT _settings
{
"index.refresh_interval": 1
}
In 2.0, the above request will throw an exception. Instead the refresh
interval should be set to "1s" for one second.
16.14.4. Merge and merge throttling settings
The tiered merge policy is now the only supported merge policy. These settings have been removed:
-
index.merge.policy.type -
index.merge.policy.min_merge_size -
index.merge.policy.max_merge_size -
index.merge.policy.merge_factor -
index.merge.policy.max_merge_docs -
index.merge.policy.calibrate_size_by_deletes -
index.merge.policy.min_merge_docs -
index.merge.policy.max_merge_docs
Merge throttling now uses a feedback loop to auto-throttle. These settings have been removed:
-
indices.store.throttle.type -
indices.store.throttle.max_bytes_per_sec -
index.store.throttle.type -
index.store.throttle.max_bytes_per_sec
16.14.5. Shadow replica settings
The node.enable_custom_paths setting has been removed and replaced by the
path.shared_data setting to allow shadow replicas with custom paths to work
with the security manager. For example, if your previous configuration had:
node.enable_custom_paths: true
And you created an index using shadow replicas with index.data_path set to
/opt/data/my_index with the following:
PUT /my_index
{
"index": {
"number_of_shards": 1,
"number_of_replicas": 4,
"data_path": "/opt/data/my_index",
"shadow_replicas": true
}
}
For 2.0, you will need to set path.shared_data to a parent directory of the
index’s data_path, so:
path.shared_data: /opt/data
16.14.6. Resource watcher settings renamed
The setting names for configuring the resource watcher have been renamed to prevent clashes with the watcher plugin
-
watcher.enabledis nowresource.reload.enabled -
watcher.intervalis nowresource.reload.interval -
watcher.interval.lowis nowresource.reload.interval.low -
watcher.interval.mediumis nowresource.reload.interval.medium -
watcher.interval.highis nowresource.reload.interval.high
16.14.8. Hunspell dictionary configuration
The parameter indices.analysis.hunspell.dictionary.location has been
removed, and <path.conf>/hunspell is always used.
16.14.9. CORS allowed origins
The CORS allowed origins setting, http.cors.allow-origin, no longer has a default value. Previously, the default value
was *, which would allow CORS requests from any origin and is considered insecure. The http.cors.allow-origin setting
should be specified with only the origins that should be allowed, like so:
http.cors.allow-origin: /https?:\/\/localhost(:[0-9]+)?/
16.14.10. JSONP support
JSONP callback support has now been removed. CORS should be used to access Elasticsearch over AJAX instead:
http.cors.enabled: true
http.cors.allow-origin: /https?:\/\/localhost(:[0-9]+)?/
16.14.11. In memory indices
The memory / ram store (index.store.type) option was removed in
Elasticsearch. In-memory indices are no longer supported.
16.14.12. Log messages truncated
Log messages are now truncated at 10,000 characters. This can be changed in
the logging.yml configuration file with the file.layout.conversionPattern
setting.
16.14.13. Custom config file
It is no longer possible to specify a custom config file with the CONF_FILE
environment variable, or the -Des.config, -Des.default.config, or
-Delasticsearch.config parameters.
Instead, the config file must be named elasticsearch.yml and must be located
in the default config/ directory, unless a custom config directory is specified.
The location of a custom config directory may be specified as follows:
./bin/elasticsearch --path.conf=/path/to/conf/dir
./bin/plugin -Des.path.conf=/path/to/conf/dir install analysis-icu
When using the RPM or debian packages, the plugin script and the
init/service scripts will consult the CONF_DIR environment variable
to check for a custom config location. The value of the CONF_DIR
variable can be set in the environment config file which is located either in
/etc/default/elasticsearch or /etc/sysconfig/elasticsearch.
16.15. Stats, info, and cat changes
16.15.1. Sigar removed
We no longer ship the Sigar library for operating system dependent statistics, as it no longer seems to be maintained. Instead, we rely on the statistics provided by the JVM. This has resulted in a number of changes to the node info, and node stats responses:
-
network.*has been removed from nodes info and nodes stats. -
fs.*.devandfs.*.disk*have been removed from nodes stats. -
os.*has been removed from nodes stats, except foros.timestamp,os.load_average,os.mem.*, andos.swap.*. -
os.mem.totalandos.swap.totalhave been removed from nodes info. -
process.mem.residentandprocess.mem.sharehave been removed from node stats.
16.15.2. Removed id_cache from stats apis
Removed id_cache metric from nodes stats, indices stats and cluster stats
apis. This metric has also been removed from the shards cat, indices cat and
nodes cat apis. Parent/child memory is now reported under fielddata, because
it has internally be using fielddata for a while now.
To just see how much parent/child related field data is taking, the
fielddata_fields option can be used on the stats apis. Indices stats
example:
GET /_stats/fielddata?fielddata_fields=_parent
16.15.3. Percolator stats
The total time spent running percolator queries is now called percolate.time
instead of percolate.get_time.
16.16. Java API changes
16.16.1. Transport API construction
The TransportClient construction code has changed, it now uses the builder
pattern. Instead of:
Settings settings = Settings.settingsBuilder()
.put("cluster.name", "myClusterName").build();
Client client = new TransportClient(settings);
Use the following:
Settings settings = Settings.settingsBuilder()
.put("cluster.name", "myClusterName").build();
Client client = TransportClient.builder().settings(settings).build();
The transport client also no longer supports loading settings from config files. If you have a config file, you can load it into settings yourself before constructing the transport client:
Settings settings = Settings.settingsBuilder()
.loadFromPath(pathToYourSettingsFile).build();
Client client = TransportClient.builder().settings(settings).build();
16.16.2. Exception are only thrown on total failure
Previously, many APIs would throw an exception if any shard failed to execute
the request. Now the exception is only thrown if all shards fail the request.
The responses for these APIs will always have a getShardFailures method that
you can and should check for failures.
16.16.4. Automatically thread client listeners
Previously, the user had to set request listener threads to true when on the
client side in order not to block IO threads on heavy operations. This proved
to be very trappy for users, and ended up creating problems that are very hard
to debug.
In 2.0, Elasticsearch automatically threads listeners that are used from the client when the client is a node client or a transport client. Threading can no longer be manually set.
16.16.5. Query/filter refactoring
org.elasticsearch.index.queries.FilterBuilders has been removed as part of the merge of
queries and filters. These filters are now available in QueryBuilders with the same name.
All methods that used to accept a FilterBuilder now accept a QueryBuilder instead.
In addition some query builders have been removed or renamed:
-
commonTerms(...)renamed withcommonTermsQuery(...) -
queryString(...)renamed withqueryStringQuery(...) -
simpleQueryString(...)renamed withsimpleQueryStringQuery(...) -
textPhrase(...)removed -
textPhrasePrefix(...)removed -
textPhrasePrefixQuery(...)removed -
filtered(...)removed. UsefilteredQuery(...)instead. -
inQuery(...)removed.
16.16.6. GetIndexRequest
GetIndexRequest.features() now returns an array of Feature Enums instead of an array of String values.
The following deprecated methods have been removed:
-
GetIndexRequest.addFeatures(String[])- UseGetIndexRequest.addFeatures(Feature[])instead -
GetIndexRequest.features(String[])- UseGetIndexRequest.features(Feature[])instead. -
GetIndexRequestBuilder.addFeatures(String[])- UseGetIndexRequestBuilder.addFeatures(Feature[])instead. -
GetIndexRequestBuilder.setFeatures(String[])- UseGetIndexRequestBuilder.setFeatures(Feature[])instead.
16.16.7. BytesQueryBuilder removed
The redundant BytesQueryBuilder has been removed in favour of the WrapperQueryBuilder internally.
16.16.8. TermsQueryBuilder execution removed
The TermsQueryBuilder#execution method has been removed as it has no effect, it is ignored by the
corresponding parser.
16.16.10. InetSocketTransportAddress removed
Use InetSocketTransportAddress(InetSocketAddress address) instead of InetSocketTransportAddress(String, int).
You can create an InetSocketAddress instance with InetSocketAddress(String, int). For example:
new InetSocketTransportAddress(new InetSocketAddress("127.0.0.1", 0));
16.16.11. Request Builders refactoring
An action parameter has been added to various request builders:
-
Instead of
new SnapshotsStatusRequestBuilder(elasticSearchClient)usenew SnapshotsStatusRequestBuilder(elasticSearchClient, SnapshotsStatusAction.INSTANCE). -
Instead of
new CreateSnapshotRequestBuilder(elasticSearchClient)usenew CreateSnapshotRequestBuilder(elasticSearchClient, CreateSnapshotAction.INSTANCE). -
Instead of
new CreateIndexRequestBuilder(elasticSearchClient, index)usenew CreateIndexRequestBuilder(elasticSearchClient, CreateIndexAction.INSTANCE, index).
16.16.12. Shading and package relocation removed
Elasticsearch used to shade its dependencies and to relocate packages. We no longer use shading or relocation. You might need to change your imports to the original package names:
-
com.google.commonwasorg.elasticsearch.common -
com.carrotsearch.hppcwasorg.elasticsearch.common.hppc -
jsr166ewasorg.elasticsearch.common.util.concurrent.jsr166e -
com.fasterxml.jacksonwasorg.elasticsearch.common.jackson -
org.joda.timewasorg.elasticsearch.common.joda.time -
org.joda.convertwasorg.elasticsearch.common.joda.convert -
org.jboss.nettywasorg.elasticsearch.common.netty -
com.ning.compresswasorg.elasticsearch.common.compress -
com.github.mustachejavawasorg.elasticsearch.common.mustache -
com.tdunning.math.statswasorg.elasticsearch.common.stats -
org.apache.commons.langwasorg.elasticsearch.common.lang -
org.apache.commons.cliwasorg.elasticsearch.common.cli.commons
17. Breaking changes in 1.6
This section discusses the changes that you need to be aware of when migrating your application from Elasticsearch 1.x to Elasticsearch 1.6.
More Like This API
The More Like This API query has been deprecated and will be removed in 2.0. Instead use the More Like This Query.
top_children query
The top_children query has been deprecated and will be removed in 2.0. Instead the has_child query should be used.
The top_children query isn’t always faster than the has_child query and the top_children query is often inaccurate.
The total hits and any aggregations in the same search request will likely be off.
18. Breaking changes in 1.4
This section discusses the changes that you need to be aware of when migrating your application from Elasticsearch 1.x to Elasticsearch 1.4.
Percolator
In indices created with version 1.4.0 or later, percolation queries can only
refer to fields that already exist in the mappings in that index. There are
two ways to make sure that a field mapping exist:
-
Add or update a mapping via the create index or put mapping apis.
-
Percolate a document before registering a query. Percolating a document can add field mappings dynamically, in the same way as happens when indexing a document.
Aliases
Aliases can include filters which
are automatically applied to any search performed via the alias.
Filtered aliases created with version 1.4.0 or later can only
refer to field names which exist in the mappings of the index (or indices)
pointed to by the alias.
Add or update a mapping via the create index or put mapping apis.
Indices APIs
The get warmer api will return a section for warmers even if there are
no warmers. This ensures that the following two examples are equivalent:
curl -XGET 'http://localhost:9200/_all/_warmers'
curl -XGET 'http://localhost:9200/_warmers'
The get alias api will return a section for aliases even if there are
no aliases. This ensures that the following two examples are equivalent:
curl -XGET 'http://localhost:9200/_all/_aliases'
curl -XGET 'http://localhost:9200/_aliases'
The get mapping api will return a section for mappings even if there are
no mappings. This ensures that the following two examples are equivalent:
curl -XGET 'http://localhost:9200/_all/_mappings'
curl -XGET 'http://localhost:9200/_mappings'
Bulk UDP
Bulk UDP has been deprecated and will be removed in 2.0. You should use the standard bulk API instead. Each cluster must have an elected master node in order to be fully operational. Once a node loses its elected master node it will reject some or all operations.
Zen discovery
On versions before 1.4.0.Beta1 all operations are rejected when a node loses its elected master. From 1.4.0.Beta1
only write operations will be rejected by default. Read operations will still be served based on the information available
to the node, which may result in being partial and possibly also stale. If the default is undesired then the
pre 1.4.0.Beta1 behaviour can be enabled, see: no-master-block
More Like This Field
The More Like This Field query has been deprecated in favor of the More Like This Query
restrained set to a specific field. It will be removed in 2.0.
MVEL is deprecated
Groovy is the new default scripting language in Elasticsearch, and is enabled in sandbox mode
by default. MVEL has been removed from core, but is available as a plugin:
https://github.com/elasticsearch/elasticsearch-lang-mvel
19. Breaking changes in 1.0
This section discusses the changes that you need to be aware of when migrating your application to Elasticsearch 1.0.
19.1. System and settings
-
Elasticsearch now runs in the foreground by default. There is no more
-fflag on the command line. Instead, to run elasticsearch as a daemon, use the-dflag:
./bin/elasticsearch -d
-
Command line settings can now be passed without the
-Des.prefix, for instance:
./bin/elasticsearch --node.name=search_1 --cluster.name=production
-
Elasticsearch on 64 bit Linux now uses
mmapfsby default. Make sure that you setMAX_MAP_COUNTto a sufficiently high number. The RPM and Debian packages default this value to262144. -
The RPM and Debian packages no longer start Elasticsearch by default.
-
The
cluster.routing.allocationsettings (disable_allocation,disable_new_allocationanddisable_replica_location) have been replaced by the single setting:cluster.routing.allocation.enable: all|primaries|new_primaries|none
19.2. Stats and Info APIs
The cluster_state, nodes_info,
nodes_stats and indices_stats
APIs have all been changed to make their format more RESTful and less clumsy.
For instance, if you just want the nodes section of the cluster_state,
instead of:
GET /_cluster/state?filter_metadata&filter_routing_table&filter_blocks
you now use:
GET /_cluster/state/nodes
Similarly for the nodes_stats API, if you want the transport and http
metrics only, instead of:
GET /_nodes/stats?clear&transport&http
you now use:
GET /_nodes/stats/transport,http
See the links above for full details.
19.3. Indices APIs
The mapping, alias, settings, and warmer index APIs are all similar
but there are subtle differences in the order of the URL and the response
body. For instance, adding a mapping and a warmer look slightly different:
PUT /{index}/{type}/_mapping
PUT /{index}/_warmer/{name}
These URLs have been unified as:
PUT /{indices}/_mapping/{type}
PUT /{indices}/_alias/{name}
PUT /{indices}/_warmer/{name}
GET /{indices}/_mapping/{types}
GET /{indices}/_alias/{names}
GET /{indices}/_settings/{names}
GET /{indices}/_warmer/{names}
DELETE /{indices}/_mapping/{types}
DELETE /{indices}/_alias/{names}
DELETE /{indices}/_warmer/{names}
All of the {indices}, {types} and {names} parameters can be replaced by:
-
_all,*or blank (ie left out altogether), all of which mean “all” -
wildcards like
test* -
comma-separated lists:
index_1,test_*
The only exception is DELETE which doesn’t accept blank (missing)
parameters. If you want to delete something, you should be specific.
Similarly, the return values for GET have been unified with the following
rules:
-
Only return values that exist. If you try to
GETa mapping which doesn’t exist, then the result will be an empty object:{}. We no longer throw a404if the requested mapping/warmer/alias/setting doesn’t exist. -
The response format always has the index name, then the section, then the element name, for instance:
{ "my_index": { "mappings": { "my_type": {...} } } }This is a breaking change for the
get_mappingAPI.
In the future we will also provide plural versions to allow putting multiple mappings etc in a single request.
See put-mapping, get-
mapping, get-field-mapping,
update-settings, get-settings,
warmers, and aliases for more details.
19.4. Index request
Previously a document could be indexed as itself, or wrapped in an outer
object which specified the type name:
PUT /my_index/my_type/1
{
"my_type": {
... doc fields ...
}
}
This led to some ambiguity when a document also included a field with the same
name as the type. We no longer accept the outer type wrapper, but this
behaviour can be reenabled on an index-by-index basis with the setting:
index.mapping.allow_type_wrapper.
19.5. Search requests
While the search API takes a top-level query parameter, the
count, delete-by-query and
validate-query requests expected the whole body to be a
query. These now require a top-level query parameter:
GET /_count
{
"query": {
"match": {
"title": "Interesting stuff"
}
}
}
Also, the top-level filter parameter in search has been renamed to
post_filter, to indicate that it should not
be used as the primary way to filter search results (use a
bool query instead), but only to filter
results AFTER aggregations have been calculated.
This example counts the top colors in all matching docs, but only returns docs
with color red:
GET /_search
{
"query": {
"match_all": {}
},
"aggs": {
"colors": {
"terms": { "field": "color" }
}
},
"post_filter": {
"term": {
"color": "red"
}
}
}
19.6. Multi-fields
Multi-fields are dead! Long live multi-fields! Well, the field type
multi_field has been removed. Instead, any of the core field types
(excluding object and nested) now accept a fields parameter. It’s the
same thing, but nicer. Instead of:
"title": {
"type": "multi_field",
"fields": {
"title": { "type": "string" },
"raw": { "type": "string", "index": "not_analyzed" }
}
}
you can now write:
"title": {
"type": "string",
"fields": {
"raw": { "type": "string", "index": "not_analyzed" }
}
}
Existing multi-fields will be upgraded to the new format automatically.
Also, instead of having to use the arcane path and index_name parameters
in order to index multiple fields into a single “custom _all field”, you
can now use the copy_to parameter.
19.8. Dates without years
When dates are specified without a year, for example: Dec 15 10:00:00 they
are treated as dates in 2000 during indexing and range searches… except for
the upper included bound lte where they were treated as dates in 1970! Now,
all dates without years
use 1970 as the default.
19.9. Parameters
-
Geo queries used to use
milesas the default unit. And we all know what happened at NASA because of that decision. The new default unit ismeters. -
For all queries that support fuzziness, the
min_similarity,fuzzinessandedit_distanceparameters have been unified as the single parameterfuzziness. See Fuzziness for details of accepted values. -
The
ignore_missingparameter has been replaced by theexpand_wildcards,ignore_unavailableandallow_no_indicesparameters, all of which have sensible defaults. See the multi-index docs for more. -
An index name (or pattern) is now required for destructive operations like deleting indices:
# v0.90 - delete all indices: DELETE / # v1.0 - delete all indices: DELETE /_all DELETE /*Setting
action.destructive_requires_nametotrueprovides further safety by disabling wildcard expansion on destructive actions.
19.10. Return values
-
The
okreturn value has been removed from all response bodies as it added no useful information. -
The
found,not_foundandexistsreturn values have been unified asfoundon all relevant APIs. -
Field values, in response to the
fieldsparameter, are now always returned as arrays. A field could have single or multiple values, which meant that sometimes they were returned as scalars and sometimes as arrays. By always returning arrays, this simplifies user code. The only exception to this rule is whenfieldsis used to retrieve metadata like theroutingvalue, which are always singular. Metadata fields are always returned as scalars.The
fieldsparameter is intended to be used for retrieving stored fields, rather than for fields extracted from the_source. That means that it can no longer be used to return whole objects and it no longer accepts the_source.fieldnameformat. For these you should use the_source_source_includeand_source_excludeparameters instead. -
Settings, like
index.analysis.analyzer.defaultare now returned as proper nested JSON objects, which makes them easier to work with programmatically:{ "index": { "analysis": { "analyzer": { "default": xxx } } } }You can choose to return them in flattened format by passing
?flat_settingsin the query string. -
The
analyzeAPI no longer supports the text response format, but does support JSON and YAML.
19.11. Deprecations
-
The
textquery has been removed. Use thematchquery instead. -
The
fieldquery has been removed. Use thequery_stringquery instead. -
Per-document boosting with the
_boostfield has been removed. You can use thefunction_scoreinstead. -
The
pathparameter in mappings has been deprecated. Use thecopy_toparameter instead. -
The
custom_scoreandcustom_boost_scoreis no longer supported. You can usefunction_scoreinstead.
19.12. Percolator
The percolator has been redesigned and because of this the dedicated _percolator index is no longer used by the percolator,
but instead the percolator works with a dedicated .percolator type. Read the redesigned percolator
blog post for the reasons why the percolator has been redesigned.
Elasticsearch will not delete the _percolator index when upgrading, only the percolate api will not use the queries
stored in the _percolator index. In order to use the already stored queries, you can just re-index the queries from the
_percolator index into any index under the reserved .percolator type. The format in which the percolate queries
were stored has not been changed. So a simple script that does a scan search to retrieve all the percolator queries
and then does a bulk request into another index should be sufficient.
API Conventions
The elasticsearch REST APIs are exposed using JSON over HTTP.
The conventions listed in this chapter can be applied throughout the REST API, unless otherwise specified.
20. Multiple Indices
Most APIs that refer to an index parameter support execution across multiple indices,
using simple test1,test2,test3 notation (or _all for all indices). It also
support wildcards, for example: test*, and the ability to "add" (+)
and "remove" (-), for example: +test*,-test3.
All multi indices API support the following url query string parameters:
ignore_unavailable-
Controls whether to ignore if any specified indices are unavailable, this includes indices that don’t exist or closed indices. Either
trueorfalsecan be specified. allow_no_indices-
Controls whether to fail if a wildcard indices expressions results into no concrete indices. Either
trueorfalsecan be specified. For example if the wildcard expressionfoo*is specified and no indices are available that start withfoothen depending on this setting the request will fail. This setting is also applicable when_all,*or no index has been specified. This settings also applies for aliases, in case an alias points to a closed index. expand_wildcards-
Controls to what kind of concrete indices wildcard indices expression expand to. If
openis specified then the wildcard expression is expanded to only open indices and ifclosedis specified then the wildcard expression is expanded only to closed indices. Also both values (open,closed) can be specified to expand to all indices.
If none is specified then wildcard expansion will be disabled and if all
is specified, wildcard expressions will expand to all indices (this is equivalent
to specifying open,closed).
The defaults settings for the above parameters depend on the api being used.
|
|
Single index APIs such as the Document APIs and the
single-index alias APIs do not support multiple indices.
|
21. Date math support in index names
Date math index name resolution enables you to search a range of time-series indices, rather than searching all of your time-series indices and filtering the results or maintaining aliases. Limiting the number of indices that are searched reduces the load on the cluster and improves execution performance. For example, if you are searching for errors in your daily logs, you can use a date math name template to restrict the search to the past two days.
Almost all APIs that have an index parameter, support date math in the index parameter
value.
A date math index name takes the following form:
<static_name{date_math_expr{date_format|time_zone}}>
Where:
static_name
|
is the static text part of the name |
date_math_expr
|
is a dynamic date math expression that computes the date dynamically |
date_format
|
is the optional format in which the computed date should be rendered. Defaults to |
time_zone
|
is the optional time zone . Defaults to |
You must enclose date math index name expressions within angle brackets. For example:
curl -XGET 'localhost:9200/<logstash-{now%2Fd-2d}>/_search' {
"query" : {
...
}
}
|
|
The / used for date rounding must be url encoded as %2F in any url.
|
The following example shows different forms of date math index names and the final index names they resolve to given the current time is 22rd March 2024 noon utc.
| Expression | Resolves to |
|---|---|
|
|
|
|
|
|
|
|
|
|
To use the characters { and } in the static part of an index name template, escape them
with a backslash \, for example:
-
<elastic\\{ON\\}-{now/M}>resolves toelastic{ON}-2024.03.01
The following example shows a search request that searches the Logstash indices for the past
three days, assuming the indices use the default Logstash index name format,
logstash-YYYY.MM.dd.
curl -XGET 'localhost:9200/<logstash-{now%2Fd-2d}>,<logstash-{now%2Fd-1d}>,<logstash-{now%2Fd}>/_search' {
"query" : {
...
}
}
22. Common options
The following options can be applied to all of the REST APIs.
Pretty Results
When appending ?pretty=true to any request made, the JSON returned
will be pretty formatted (use it for debugging only!). Another option is
to set ?format=yaml which will cause the result to be returned in the
(sometimes) more readable yaml format.
Human readable output
Statistics are returned in a format suitable for humans
(eg "exists_time": "1h" or "size": "1kb") and for computers
(eg "exists_time_in_millis": 3600000 or "size_in_bytes": 1024).
The human readable values can be turned off by adding ?human=false
to the query string. This makes sense when the stats results are
being consumed by a monitoring tool, rather than intended for human
consumption. The default for the human flag is
false.
Date Math
Most parameters which accept a formatted date value — such as gt and lt
in range queries range queries, or from and to
in daterange
aggregations — understand date maths.
The expression starts with an anchor date, which can either be now, or a
date string ending with ||. This anchor date can optionally be followed by
one or more maths expressions:
-
+1h- add one hour -
-1d- subtract one day -
/d- round down to the nearest day
The supported time units are: y (year), M (month), w (week),
d (day), h (hour), m (minute), and s (second).
Some examples are:
now+1h
|
The current time plus one hour, with ms resolution. |
now+1h+1m
|
The current time plus one hour plus one minute, with ms resolution. |
now+1h/d
|
The current time plus one hour, rounded down to the nearest day. |
2015-01-01||+1M/d
|
|
Response Filtering
All REST APIs accept a filter_path parameter that can be used to reduce
the response returned by elasticsearch. This parameter takes a comma
separated list of filters expressed with the dot notation:
curl -XGET 'localhost:9200/_search?pretty&filter_path=took,hits.hits._id,hits.hits._score'
{
"took" : 3,
"hits" : {
"hits" : [
{
"_id" : "3640",
"_score" : 1.0
},
{
"_id" : "3642",
"_score" : 1.0
}
]
}
}
It also supports the * wildcard character to match any field or part
of a field’s name:
curl -XGET 'localhost:9200/_nodes/stats?filter_path=nodes.*.ho*'
{
"nodes" : {
"lvJHed8uQQu4brS-SXKsNA" : {
"host" : "portable"
}
}
}
And the ** wildcard can be used to include fields without knowing the
exact path of the field. For example, we can return the Lucene version
of every segment with this request:
curl 'localhost:9200/_segments?pretty&filter_path=indices.**.version'
{
"indices" : {
"movies" : {
"shards" : {
"0" : [ {
"segments" : {
"_0" : {
"version" : "5.2.0"
}
}
} ],
"2" : [ {
"segments" : {
"_0" : {
"version" : "5.2.0"
}
}
} ]
}
},
"books" : {
"shards" : {
"0" : [ {
"segments" : {
"_0" : {
"version" : "5.2.0"
}
}
} ]
}
}
}
}
Note that elasticsearch sometimes returns directly the raw value of a field,
like the _source field. If you want to filter _source fields, you should
consider combining the already existing _source parameter (see
Get API for more details) with the filter_path
parameter like this:
curl -XGET 'localhost:9200/_search?pretty&filter_path=hits.hits._source&_source=title'
{
"hits" : {
"hits" : [ {
"_source":{"title":"Book #2"}
}, {
"_source":{"title":"Book #1"}
}, {
"_source":{"title":"Book #3"}
} ]
}
}
Flat Settings
The flat_settings flag affects rendering of the lists of settings. When
flat_settings flag is true settings are returned in a flat format:
{
"persistent" : { },
"transient" : {
"discovery.zen.minimum_master_nodes" : "1"
}
}
When the flat_settings flag is false settings are returned in a more
human readable structured format:
{
"persistent" : { },
"transient" : {
"discovery" : {
"zen" : {
"minimum_master_nodes" : "1"
}
}
}
}
By default the flat_settings is set to false.
Parameters
Rest parameters (when using HTTP, map to HTTP URL parameters) follow the convention of using underscore casing.
Boolean Values
All REST APIs parameters (both request parameters and JSON body) support
providing boolean "false" as the values: false, 0, no and off.
All other values are considered "true". Note, this is not related to
fields within a document indexed treated as boolean fields.
Number Values
All REST APIs support providing numbered parameters as string on top
of supporting the native JSON number types.
Time units
Whenever durations need to be specified, eg for a timeout parameter, the
duration must specify the unit, like 2d for 2 days. The supported units
are:
y
|
Year |
M
|
Month |
w
|
Week |
d
|
Day |
h
|
Hour |
m
|
Minute |
s
|
Second |
ms
|
Milli-second |
Data size units
Whenever the size of data needs to be specified, eg when setting a buffer size
parameter, the value must specify the unit, like 10kb for 10 kilobytes. The
supported units are:
b
|
Bytes |
kb
|
Kilobytes |
mb
|
Megabytes |
gb
|
Gigabytes |
tb
|
Terabytes |
pb
|
Petabytes |
Distance Units
Wherever distances need to be specified, such as the distance parameter in
the Geo Distance Query), the default unit if none is specified is
the meter. Distances can be specified in other units, such as "1km" or
"2mi" (2 miles).
The full list of units is listed below:
| Mile |
|
| Yard |
|
| Feet |
|
| Inch |
|
| Kilometer |
|
| Meter |
|
| Centimeter |
|
| Millimeter |
|
| Nautical mile |
|
The precision parameter in the Geohash Cell Query accepts
distances with the above units, but if no unit is specified, then the
precision is interpreted as the length of the geohash.
Fuzziness
Some queries and APIs support parameters to allow inexact fuzzy matching,
using the fuzziness parameter. The fuzziness parameter is context
sensitive which means that it depends on the type of the field being queried:
Numeric, date and IPv4 fields
When querying numeric, date and IPv4 fields, fuzziness is interpreted as a
+/- margin. It behaves like a Range Query where:
-fuzziness <= field value <= +fuzziness
The fuzziness parameter should be set to a numeric value, eg 2 or 2.0. A
date field interprets a long as milliseconds, but also accepts a string
containing a time value — "1h" — as explained in Time units. An ip
field accepts a long or another IPv4 address (which will be converted into a
long).
String fields
When querying string fields, fuzziness is interpreted as a
Levenshtein Edit Distance — the number of one character changes that need to be made to one string to
make it the same as another string.
The fuzziness parameter can be specified as:
0,1,2-
the maximum allowed Levenshtein Edit Distance (or number of edits)
AUTO-
generates an edit distance based on the length of the term. For lengths:
0..2-
must match exactly
3..5-
one edit allowed
>5-
two edits allowed
AUTOshould generally be the preferred value forfuzziness.
Result Casing
All REST APIs accept the case parameter. When set to camelCase, all
field names in the result will be returned in camel casing, otherwise,
underscore casing will be used. Note, this does not apply to the source
document indexed.
Request body in query string
For libraries that don’t accept a request body for non-POST requests,
you can pass the request body as the source query string parameter
instead.
23. URL-based access control
Many users use a proxy with URL-based access control to secure access to Elasticsearch indices. For multi-search, multi-get and bulk requests, the user has the choice of specifying an index in the URL and on each individual request within the request body. This can make URL-based access control challenging.
To prevent the user from overriding the index which has been specified in the
URL, add this setting to the config.yml file:
rest.action.multi.allow_explicit_index: false
The default value is true, but when set to false, Elasticsearch will
reject requests that have an explicit index specified in the request body.
Document APIs
This section describes the following CRUD APIs:
|
|
All CRUD APIs are single-index APIs. The index parameter accepts a single
index name, or an alias which points to a single index.
|
24. Index API
The index API adds or updates a typed JSON document in a specific index, making it searchable. The following example inserts the JSON document into the "twitter" index, under a type called "tweet" with an id of 1:
$ curl -XPUT 'http://localhost:9200/twitter/tweet/1' -d '{
"user" : "kimchy",
"post_date" : "2009-11-15T14:12:12",
"message" : "trying out Elasticsearch"
}'
The result of the above index operation is:
{
"_shards" : {
"total" : 10,
"failed" : 0,
"successful" : 10
},
"_index" : "twitter",
"_type" : "tweet",
"_id" : "1",
"_version" : 1,
"created" : true
}
The _shards header provides information about the replication process of the index operation.
-
total- Indicates to how many shard copies (primary and replica shards) the index operation should be executed on. -
successful- Indicates the number of shard copies the index operation succeeded on. -
failures- An array that contains replication related errors in the case an index operation failed on a replica shard.
The index operation is successful in the case successful is at least 1.
|
|
Replica shards may not all be started when an indexing operation successfully returns (by default, a quorum is
required). In that case, total will be equal to the total shards based on the index replica settings and
successful will be equal to the number of shards started (primary plus replicas). As there were no failures,
the failed will be 0.
|
Automatic Index Creation
The index operation automatically creates an index if it has not been created before (check out the create index API for manually creating an index), and also automatically creates a dynamic type mapping for the specific type if one has not yet been created (check out the put mapping API for manually creating a type mapping).
The mapping itself is very flexible and is schema-free. New fields and objects will automatically be added to the mapping definition of the type specified. Check out the mapping section for more information on mapping definitions.
Automatic index creation can be disabled by setting
action.auto_create_index to false in the config file of all nodes.
Automatic mapping creation can be disabled by setting
index.mapper.dynamic to false in the config files of all nodes (or
on the specific index settings).
Automatic index creation can include a pattern based white/black list,
for example, set action.auto_create_index to +aaa*,-bbb*,+ccc*,-* (+
meaning allowed, and - meaning disallowed).
Versioning
Each indexed document is given a version number. The associated
version number is returned as part of the response to the index API
request. The index API optionally allows for
optimistic
concurrency control when the version parameter is specified. This
will control the version of the document the operation is intended to be
executed against. A good example of a use case for versioning is
performing a transactional read-then-update. Specifying a version from
the document initially read ensures no changes have happened in the
meantime (when reading in order to update, it is recommended to set
preference to _primary). For example:
curl -XPUT 'localhost:9200/twitter/tweet/1?version=2' -d '{
"message" : "elasticsearch now has versioning support, double cool!"
}'
NOTE: versioning is completely real time, and is not affected by the near real time aspects of search operations. If no version is provided, then the operation is executed without any version checks.
By default, internal versioning is used that starts at 1 and increments
with each update, deletes included. Optionally, the version number can be
supplemented with an external value (for example, if maintained in a
database). To enable this functionality, version_type should be set to
external. The value provided must be a numeric, long value greater or equal to 0,
and less than around 9.2e+18. When using the external version type, instead
of checking for a matching version number, the system checks to see if
the version number passed to the index request is greater than the
version of the currently stored document. If true, the document will be
indexed and the new version number used. If the value provided is less
than or equal to the stored document’s version number, a version
conflict will occur and the index operation will fail.
A nice side effect is that there is no need to maintain strict ordering of async indexing operations executed as a result of changes to a source database, as long as version numbers from the source database are used. Even the simple case of updating the elasticsearch index using data from a database is simplified if external versioning is used, as only the latest version will be used if the index operations are out of order for whatever reason.
Version types
Next to the internal & external version types explained above, Elasticsearch
also supports other types for specific use cases. Here is an overview of
the different version types and their semantics.
internal-
only index the document if the given version is identical to the version of the stored document.
externalorexternal_gt-
only index the document if the given version is strictly higher than the version of the stored document or if there is no existing document. The given version will be used as the new version and will be stored with the new document. The supplied version must be a non-negative long number.
external_gte-
only index the document if the given version is equal or higher than the version of the stored document. If there is no existing document the operation will succeed as well. The given version will be used as the new version and will be stored with the new document. The supplied version must be a non-negative long number.
force-
the document will be indexed regardless of the version of the stored document or if there is no existing document. The given version will be used as the new version and will be stored with the new document. This version type is typically used for correcting errors.
NOTE: The external_gte & force version types are meant for special use cases and should be used
with care. If used incorrectly, they can result in loss of data.
Operation Type
The index operation also accepts an op_type that can be used to force
a create operation, allowing for "put-if-absent" behavior. When
create is used, the index operation will fail if a document by that id
already exists in the index.
Here is an example of using the op_type parameter:
$ curl -XPUT 'http://localhost:9200/twitter/tweet/1?op_type=create' -d '{
"user" : "kimchy",
"post_date" : "2009-11-15T14:12:12",
"message" : "trying out Elasticsearch"
}'
Another option to specify create is to use the following uri:
$ curl -XPUT 'http://localhost:9200/twitter/tweet/1/_create' -d '{
"user" : "kimchy",
"post_date" : "2009-11-15T14:12:12",
"message" : "trying out Elasticsearch"
}'
Automatic ID Generation
The index operation can be executed without specifying the id. In such a
case, an id will be generated automatically. In addition, the op_type
will automatically be set to create. Here is an example (note the
POST used instead of PUT):
$ curl -XPOST 'http://localhost:9200/twitter/tweet/' -d '{
"user" : "kimchy",
"post_date" : "2009-11-15T14:12:12",
"message" : "trying out Elasticsearch"
}'
The result of the above index operation is:
{
"_index" : "twitter",
"_type" : "tweet",
"_id" : "6a8ca01c-7896-48e9-81cc-9f70661fcb32",
"_version" : 1,
"created" : true
}
Routing
By default, shard placement — or routing — is controlled by using a
hash of the document’s id value. For more explicit control, the value
fed into the hash function used by the router can be directly specified
on a per-operation basis using the routing parameter. For example:
$ curl -XPOST 'http://localhost:9200/twitter/tweet?routing=kimchy' -d '{
"user" : "kimchy",
"post_date" : "2009-11-15T14:12:12",
"message" : "trying out Elasticsearch"
}'
In the example above, the "tweet" document is routed to a shard based on
the routing parameter provided: "kimchy".
When setting up explicit mapping, the _routing field can be optionally
used to direct the index operation to extract the routing value from the
document itself. This does come at the (very minimal) cost of an
additional document parsing pass. If the _routing mapping is defined
and set to be required, the index operation will fail if no routing
value is provided or extracted.
Parents & Children
A child document can be indexed by specifying its parent when indexing. For example:
$ curl -XPUT localhost:9200/blogs/blog_tag/1122?parent=1111 -d '{
"tag" : "something"
}'
When indexing a child document, the routing value is automatically set
to be the same as its parent, unless the routing value is explicitly
specified using the routing parameter.
Timestamp
deprecated[2.0.0-beta2,The _timestamp field is deprecated. Instead, use a normal date field and set its value explicitly]
A document can be indexed with a timestamp associated with it. The
timestamp value of a document can be set using the timestamp
parameter. For example:
$ curl -XPUT localhost:9200/twitter/tweet/1?timestamp=2009-11-15T14%3A12%3A12 -d '{
"user" : "kimchy",
"message" : "trying out Elasticsearch"
}'
If the timestamp value is not provided externally or in the _source,
the timestamp will be automatically set to the date the document was
processed by the indexing chain. More information can be found on the
_timestamp mapping
page.
TTL
deprecated[2.0.0-beta2,The current _ttl implementation is deprecated and will be replaced with a different implementation in a future version]
A document can be indexed with a ttl (time to live) associated with
it. Expired documents will be expunged automatically. The expiration
date that will be set for a document with a provided ttl is relative
to the timestamp of the document, meaning it can be based on the time
of indexing or on any time provided. The provided ttl must be strictly
positive and can be a number (in milliseconds) or any valid time value
as shown in the following examples:
curl -XPUT 'http://localhost:9200/twitter/tweet/1?ttl=86400000' -d '{
"user": "kimchy",
"message": "Trying out elasticsearch, so far so good?"
}'
curl -XPUT 'http://localhost:9200/twitter/tweet/1?ttl=1d' -d '{
"user": "kimchy",
"message": "Trying out elasticsearch, so far so good?"
}'
curl -XPUT 'http://localhost:9200/twitter/tweet/1' -d '{
"_ttl": "1d",
"user": "kimchy",
"message": "Trying out elasticsearch, so far so good?"
}'
More information can be found on the _ttl mapping page.
Distributed
The index operation is directed to the primary shard based on its route (see the Routing section above) and performed on the actual node containing this shard. After the primary shard completes the operation, if needed, the update is distributed to applicable replicas.
Write Consistency
To prevent writes from taking place on the "wrong" side of a network
partition, by default, index operations only succeed if a quorum
(>replicas/2+1) of active shards are available. This default can be
overridden on a node-by-node basis using the action.write_consistency
setting. To alter this behavior per-operation, the consistency request
parameter can be used.
Valid write consistency values are one, quorum, and all.
Note, for the case where the number of replicas is 1 (total of 2 copies of the data), then the default behavior is to succeed if 1 copy (the primary) can perform the write.
The index operation only returns after all active shards within the replication group have indexed the document (sync replication).
Refresh
To refresh the shard (not the whole index) immediately after the operation
occurs, so that the document appears in search results immediately, the
refresh parameter can be set to true. Setting this option to true should
ONLY be done after careful thought and verification that it does not lead to
poor performance, both from an indexing and a search standpoint. Note, getting
a document using the get API is completely realtime and doesn’t require a
refresh.
Noop Updates
When updating a document using the index api a new version of the document is
always created even if the document hasn’t changed. If this isn’t acceptable
use the _update api with detect_noop set to true. This option isn’t
available on the index api because the index api doesn’t fetch the old source
and isn’t able to compare it against the new source.
There isn’t a hard and fast rule about when noop updates aren’t acceptable. It’s a combination of lots of factors like how frequently your data source sends updates that are actually noops and how many queries per second elasticsearch runs on the shard with receiving the updates.
Timeout
The primary shard assigned to perform the index operation might not be
available when the index operation is executed. Some reasons for this
might be that the primary shard is currently recovering from a gateway
or undergoing relocation. By default, the index operation will wait on
the primary shard to become available for up to 1 minute before failing
and responding with an error. The timeout parameter can be used to
explicitly specify how long it waits. Here is an example of setting it
to 5 minutes:
$ curl -XPUT 'http://localhost:9200/twitter/tweet/1?timeout=5m' -d '{
"user" : "kimchy",
"post_date" : "2009-11-15T14:12:12",
"message" : "trying out Elasticsearch"
}'
25. Get API
The get API allows to get a typed JSON document from the index based on its id. The following example gets a JSON document from an index called twitter, under a type called tweet, with id valued 1:
curl -XGET 'http://localhost:9200/twitter/tweet/1'
The result of the above get operation is:
{
"_index" : "twitter",
"_type" : "tweet",
"_id" : "1",
"_version" : 1,
"found": true,
"_source" : {
"user" : "kimchy",
"postDate" : "2009-11-15T14:12:12",
"message" : "trying out Elasticsearch"
}
}
The above result includes the _index, _type, _id and _version
of the document we wish to retrieve, including the actual _source
of the document if it could be found (as indicated by the found
field in the response).
The API also allows to check for the existence of a document using
HEAD, for example:
curl -XHEAD -i 'http://localhost:9200/twitter/tweet/1'
Realtime
By default, the get API is realtime, and is not affected by the refresh rate of the index (when data will become visible for search).
In order to disable realtime GET, one can either set realtime
parameter to false, or globally default it to by setting the
action.get.realtime to false in the node configuration.
When getting a document, one can specify fields to fetch from it. They
will, when possible, be fetched as stored fields (fields mapped as
stored in the mapping). When using realtime GET, there is no notion of
stored fields (at least for a period of time, basically, until the next
flush), so they will be extracted from the source itself (note, even if
source is not enabled). It is a good practice to assume that the fields
will be loaded from source when using realtime GET, even if the fields
are stored.
Optional Type
The get API allows for _type to be optional. Set it to _all in order
to fetch the first document matching the id across all types.
Source filtering
By default, the get operation returns the contents of the _source field unless
you have used the fields parameter or if the _source field is disabled.
You can turn off _source retrieval by using the _source parameter:
curl -XGET 'http://localhost:9200/twitter/tweet/1?_source=false'
If you only need one or two fields from the complete _source, you can use the _source_include
& _source_exclude parameters to include or filter out that parts you need. This can be especially helpful
with large documents where partial retrieval can save on network overhead. Both parameters take a comma separated list
of fields or wildcard expressions. Example:
curl -XGET 'http://localhost:9200/twitter/tweet/1?_source_include=*.id&_source_exclude=entities'
If you only want to specify includes, you can use a shorter notation:
curl -XGET 'http://localhost:9200/twitter/tweet/1?_source=*.id,retweeted'
Fields
The get operation allows specifying a set of stored fields that will be
returned by passing the fields parameter. For example:
curl -XGET 'http://localhost:9200/twitter/tweet/1?fields=title,content'
For backward compatibility, if the requested fields are not stored, they will be fetched
from the _source (parsed and extracted). This functionality has been replaced by the
source filtering parameter.
Field values fetched from the document it self are always returned as an array. Metadata fields like _routing and
_parent fields are never returned as an array.
Also only leaf fields can be returned via the field option. So object fields can’t be returned and such requests
will fail.
Generated fields
If no refresh occurred between indexing and refresh, GET will access the transaction log to fetch the document. However, some fields are generated only when indexing.
If you try to access a field that is only generated when indexing, you will get an exception (default). You can choose to ignore field that are generated if the transaction log is accessed by setting ignore_errors_on_generated_fields=true.
Getting the _source directly
Use the /{index}/{type}/{id}/_source endpoint to get
just the _source field of the document,
without any additional content around it. For example:
curl -XGET 'http://localhost:9200/twitter/tweet/1/_source'
You can also use the same source filtering parameters to control which parts of the _source will be returned:
curl -XGET 'http://localhost:9200/twitter/tweet/1/_source?_source_include=*.id&_source_exclude=entities'
Note, there is also a HEAD variant for the _source endpoint to efficiently test for document existence. Curl example:
curl -XHEAD -i 'http://localhost:9200/twitter/tweet/1/_source'
Routing
When indexing using the ability to control the routing, in order to get a document, the routing value should also be provided. For example:
curl -XGET 'http://localhost:9200/twitter/tweet/1?routing=kimchy'
The above will get a tweet with id 1, but will be routed based on the user. Note, issuing a get without the correct routing, will cause the document not to be fetched.
Preference
Controls a preference of which shard replicas to execute the get
request on. By default, the operation is randomized between the shard
replicas.
The preference can be set to:
_primary-
The operation will go and be executed only on the primary shards.
_local-
The operation will prefer to be executed on a local allocated shard if possible.
- Custom (string) value
-
A custom value will be used to guarantee that the same shards will be used for the same custom value. This can help with "jumping values" when hitting different shards in different refresh states. A sample value can be something like the web session id, or the user name.
Refresh
The refresh parameter can be set to true in order to refresh the
relevant shard before the get operation and make it searchable. Setting
it to true should be done after careful thought and verification that
this does not cause a heavy load on the system (and slows down
indexing).
Distributed
The get operation gets hashed into a specific shard id. It then gets redirected to one of the replicas within that shard id and returns the result. The replicas are the primary shard and its replicas within that shard id group. This means that the more replicas we will have, the better GET scaling we will have.
Versioning support
You can use the version parameter to retrieve the document only if
it’s current version is equal to the specified one. This behavior is the same
for all version types with the exception of version type FORCE which always
retrieves the document.
Internally, Elasticsearch has marked the old document as deleted and added an entirely new document. The old version of the document doesn’t disappear immediately, although you won’t be able to access it. Elasticsearch cleans up deleted documents in the background as you continue to index more data.
26. Delete API
The delete API allows to delete a typed JSON document from a specific index based on its id. The following example deletes the JSON document from an index called twitter, under a type called tweet, with id valued 1:
$ curl -XDELETE 'http://localhost:9200/twitter/tweet/1'
The result of the above delete operation is:
{
"_shards" : {
"total" : 10,
"failed" : 0,
"successful" : 10
},
"found" : true,
"_index" : "twitter",
"_type" : "tweet",
"_id" : "1",
"_version" : 2
}
Versioning
Each document indexed is versioned. When deleting a document, the
version can be specified to make sure the relevant document we are
trying to delete is actually being deleted and it has not changed in the
meantime. Every write operation executed on a document, deletes included,
causes its version to be incremented.
Routing
When indexing using the ability to control the routing, in order to delete a document, the routing value should also be provided. For example:
$ curl -XDELETE 'http://localhost:9200/twitter/tweet/1?routing=kimchy'
The above will delete a tweet with id 1, but will be routed based on the user. Note, issuing a delete without the correct routing, will cause the document to not be deleted.
Many times, the routing value is not known when deleting a document. For
those cases, when specifying the _routing mapping as required, and
no routing value is specified, the delete will be broadcast
automatically to all shards.
Parent
The parent parameter can be set, which will basically be the same as
setting the routing parameter.
Note that deleting a parent document does not automatically delete its
children. One way of deleting all child documents given a parent’s id is
to use the delete-by-query plugin to perform a delete on the child
index with the automatically generated (and indexed)
field _parent, which is in the format parent_type#parent_id.
Automatic index creation
The delete operation automatically creates an index if it has not been created before (check out the create index API for manually creating an index), and also automatically creates a dynamic type mapping for the specific type if it has not been created before (check out the put mapping API for manually creating type mapping).
Distributed
The delete operation gets hashed into a specific shard id. It then gets redirected into the primary shard within that id group, and replicated (if needed) to shard replicas within that id group.
Write Consistency
Control if the operation will be allowed to execute based on the number
of active shards within that partition (replication group). The values
allowed are one, quorum, and all. The parameter to set it is
consistency, and it defaults to the node level setting of
action.write_consistency which in turn defaults to quorum.
For example, in a N shards with 2 replicas index, there will have to be
at least 2 active shards within the relevant partition (quorum) for
the operation to succeed. In a N shards with 1 replica scenario, there
will need to be a single shard active (in this case, one and quorum
is the same).
Refresh
The refresh parameter can be set to true in order to refresh the relevant
primary and replica shards after the delete operation has occurred and make it
searchable. Setting it to true should be done after careful thought and
verification that this does not cause a heavy load on the system (and slows
down indexing).
Timeout
The primary shard assigned to perform the delete operation might not be
available when the delete operation is executed. Some reasons for this
might be that the primary shard is currently recovering from a store
or undergoing relocation. By default, the delete operation will wait on
the primary shard to become available for up to 1 minute before failing
and responding with an error. The timeout parameter can be used to
explicitly specify how long it waits. Here is an example of setting it
to 5 minutes:
$ curl -XDELETE 'http://localhost:9200/twitter/tweet/1?timeout=5m'
27. Update API
The update API allows to update a document based on a script provided. The operation gets the document (collocated with the shard) from the index, runs the script (with optional script language and parameters), and index back the result (also allows to delete, or ignore the operation). It uses versioning to make sure no updates have happened during the "get" and "reindex".
Note, this operation still means full reindex of the document, it just
removes some network roundtrips and reduces chances of version conflicts
between the get and the index. The _source field need to be enabled
for this feature to work.
For example, lets index a simple doc:
curl -XPUT localhost:9200/test/type1/1 -d '{
"counter" : 1,
"tags" : ["red"]
}'
Scripted updates
Now, we can execute a script that would increment the counter:
curl -XPOST 'localhost:9200/test/type1/1/_update' -d '{
"script" : {
"inline": "ctx._source.counter += count",
"params" : {
"count" : 4
}
}
}'
We can add a tag to the list of tags (note, if the tag exists, it will still add it, since its a list):
curl -XPOST 'localhost:9200/test/type1/1/_update' -d '{
"script" : {
"inline": "ctx._source.tags += tag",
"params" : {
"tag" : "blue"
}
}
}'
In addition to _source, the following variables are available through
the ctx map: _index, _type, _id, _version, _routing,
_parent, _timestamp, _ttl.
We can also add a new field to the document:
curl -XPOST 'localhost:9200/test/type1/1/_update' -d '{
"script" : "ctx._source.name_of_new_field = \"value_of_new_field\""
}'
Or remove a field from the document:
curl -XPOST 'localhost:9200/test/type1/1/_update' -d '{
"script" : "ctx._source.remove(\"name_of_field\")"
}'
And, we can even change the operation that is executed. This example deletes
the doc if the tags field contain blue, otherwise it does nothing
(noop):
curl -XPOST 'localhost:9200/test/type1/1/_update' -d '{
"script" : {
"inline": "ctx._source.tags.contains(tag) ? ctx.op = \"delete\" : ctx.op = \"none\"",
"params" : {
"tag" : "blue"
}
}
}'
Updates with a partial document
The update API also support passing a partial document, which will be merged into the existing document (simple recursive merge, inner merging of objects, replacing core "keys/values" and arrays). For example:
curl -XPOST 'localhost:9200/test/type1/1/_update' -d '{
"doc" : {
"name" : "new_name"
}
}'
If both doc and script is specified, then doc is ignored. Best is
to put your field pairs of the partial document in the script itself.
Detecting noop updates
If doc is specified its value is merged with the existing _source. By
default the document is only reindexed if the new _source field differs from
the old. Setting detect_noop to false will cause Elasticsearch to always
update the document even if it hasn’t changed. For example:
curl -XPOST 'localhost:9200/test/type1/1/_update' -d '{
"doc" : {
"name" : "new_name"
},
"detect_noop": false
}'
If name was new_name before the request was sent then document is still
reindexed.
Upserts
If the document does not already exist, the contents of the upsert element
will be inserted as a new document. If the document does exist, then the
script will be executed instead:
curl -XPOST 'localhost:9200/test/type1/1/_update' -d '{
"script" : {
"inline": "ctx._source.counter += count",
"params" : {
"count" : 4
}
},
"upsert" : {
"counter" : 1
}
}'
scripted_upsert
If you would like your script to run regardless of whether the document exists
or not — i.e. the script handles initializing the document instead of the
upsert element — then set scripted_upsert to true:
curl -XPOST 'localhost:9200/sessions/session/dh3sgudg8gsrgl/_update' -d '{
"scripted_upsert":true,
"script" : {
"id": "my_web_session_summariser",
"params" : {
"pageViewEvent" : {
"url":"foo.com/bar",
"response":404,
"time":"2014-01-01 12:32"
}
}
},
"upsert" : {}
}'
doc_as_upsert
Instead of sending a partial doc plus an upsert doc, setting
doc_as_upsert to true will use the contents of doc as the upsert
value:
curl -XPOST 'localhost:9200/test/type1/1/_update' -d '{
"doc" : {
"name" : "new_name"
},
"doc_as_upsert" : true
}'
Parameters
The update operation supports the following query-string parameters:
retry_on_conflict
|
In between the get and indexing phases of the update, it is possible that
another process might have already updated the same document. By default, the
update will fail with a version conflict exception. The |
routing
|
Routing is used to route the update request to the right shard and sets the routing for the upsert request if the document being updated doesn’t exist. Can’t be used to update the routing of an existing document. |
parent
|
Parent is used to route the update request to the right shard and sets the
parent for the upsert request if the document being updated doesn’t exist.
Can’t be used to update the |
timeout
|
Timeout waiting for a shard to become available. |
consistency
|
The write consistency of the index/delete operation. |
refresh
|
Refresh the relevant primary and replica shards (not the whole index) immediately after the operation occurs, so that the updated document appears in search results immediately. |
fields
|
Return the relevant fields from the updated document. Specify |
version & version_type
|
The update API uses the Elasticsearch’s versioning support internally to make
sure the document doesn’t change during the update. You can use the |
|
|
The update API does not support external versioning
External versioning (version types |
28. Update By Query API
experimental[The update-by-query API is new and should still be considered experimental. The API may change in ways that are not backwards compatible]
The simplest usage of _update_by_query just performs an update on every
document in the index without changing the source. This is useful to
pick up a new property or some other online
mapping change. Here is the API:
POST /twitter/_update_by_query?conflicts=proceed
That will return something like this:
{
"took" : 639,
"updated": 1235,
"batches": 13,
"version_conflicts": 2,
"failures" : [ ]
}
_update_by_query gets a snapshot of the index when it starts and indexes what
it finds using internal versioning. That means that you’ll get a version
conflict if the document changes between the time when the snapshot was taken
and when the index request is processed. When the versions match the document
is updated and the version number is incremented.
All update and query failures cause the _update_by_query to abort and are
returned in the failures of the response. The updates that have been
performed still stick. In other words, the process is not rolled back, only
aborted. While the first failure causes the abort all failures that are
returned by the failing bulk request are returned in the failures element so
it’s possible for there to be quite a few.
If you want to simply count version conflicts not cause the _update_by_query
to abort you can set conflicts=proceed on the url or "conflicts": "proceed"
in the request body. The first example does this because it is just trying to
pick up an online mapping change and a version conflict simply means that the
conflicting document was updated between the start of the _update_by_query
and the time when it attempted to update the document. This is fine because
that update will have picked up the online mapping update.
Back to the API format, you can limit _update_by_query to a single type. This
will only update tweet`s from the `twitter index:
POST /twitter/tweet/_update_by_query?conflicts=proceed
You can also limit _update_by_query using the
Query DSL. This will update all documents from the
twitter index for the user kimchy:
POST /twitter/_update_by_query?conflicts=proceed
{
"query": {
"term": {
"user": "kimchy"
}
}
}
The query must be passed as a value to the query key, in the same
way as the Search API. You can also use the q
parameter in the same way as the search api. |
So far we’ve only been updating documents without changing their source. That
is genuinely useful for things like
picking up new properties but it’s only half the
fun. _update_by_query supports a script object to update the document. This
will increment the likes field on all of kimchy’s tweets:
POST /twitter/_update_by_query
{
"script": {
"inline": "ctx._source.likes++"
},
"query": {
"term": {
"user": "kimchy"
}
}
}
Just as in Update API you can set ctx.op = "noop" if
your script decides that it doesn’t have to make any changes. That will cause
_update_by_query to omit that document from its updates. Setting ctx.op to
anything else is an error. If you want to delete by a query you can use the
Delete by Query plugin instead. Setting any
other field in ctx is an error.
Note that we stopped specifying conflicts=proceed. In this case we want a
version conflict to abort the process so we can handle the failure.
This API doesn’t allow you to move the documents it touches, just modify their source. This is intentional! We’ve made no provisions for removing the document from its original location.
It’s also possible to do this whole thing on multiple indexes and multiple types at once, just like the search API:
POST /twitter,blog/tweet,post/_update_by_query
If you provide routing then the routing is copied to the scroll query,
limiting the process to the shards that match that routing value:
POST /twitter/_update_by_query?routing=1
By default _update_by_query uses scroll batches of 100. You can change the
batch size with the scroll_size URL parameter:
POST /twitter/_update_by_query?scroll_size=1000
URL Parameters
In addition to the standard parameters like pretty, the Update By Query API
also supports refresh, wait_for_completion, consistency, and timeout.
Sending the refresh will update all shards in the index being updated when
the request completes. This is different than the Index API’s refresh
parameter which causes just the shard that received the new data to be indexed.
If the request contains wait_for_completion=false then Elasticsearch will
perform some preflight checks, launch the request, and then return a task
which can be used with Tasks APIs to cancel
or get the status of the task. For now, once the request is finished the task
is gone and the only place to look for the ultimate result of the task is in
the Elasticsearch log file. This will be fixed soon.
consistency controls how many copies of a shard must respond to each write
request. timeout controls how long each write request waits for unavailable
shards to become available. Both work exactly how they work in the
Bulk API.
timeout controls how long each batch waits for the target shard to become
available. It works exactly how it works in the {ref}/docs-bulk.html[Bulk API].
Response body
The JSON response looks like this:
{
"took" : 639,
"updated": 0,
"batches": 1,
"version_conflicts": 2,
"failures" : [ ]
}
took-
The number of milliseconds from start to end of the whole operation.
updated-
The number of documents that were successfully updated.
batches-
The number of scroll responses pulled back by the the update by query.
version_conflicts-
The number of version conflicts that the update by query hit.
failures-
Array of all indexing failures. If this is non-empty then the request aborted because of those failures. See
conflictsfor how to prevent version conflicts from aborting the operation.
Works with the Task API
While Update By Query is running you can fetch their status using the Task API:
POST /_tasks/?pretty&detailed=true&action=*byquery
The responses looks like:
{
"nodes" : {
"r1A2WoRbTwKZ516z6NEs5A" : {
"name" : "Tyrannus",
"transport_address" : "127.0.0.1:9300",
"host" : "127.0.0.1",
"ip" : "127.0.0.1:9300",
"attributes" : {
"testattr" : "test",
"portsfile" : "true"
},
"tasks" : {
"r1A2WoRbTwKZ516z6NEs5A:36619" : {
"node" : "r1A2WoRbTwKZ516z6NEs5A",
"id" : 36619,
"type" : "transport",
"action" : "indices:data/write/update/byquery",
"status" : {
"total" : 6154,
"updated" : 3500,
"created" : 0,
"deleted" : 0,
"batches" : 36,
"version_conflicts" : 0,
"noops" : 0
},
"description" : ""
}
}
}
}
}
this object contains the actual status. It is just like the response json
with the important addition of the total field. total is the total number
of operations that the reindex expects to perform. You can estimate the
progress by adding the updated, created, and deleted fields. The request
will finish when their sum is equal to the total field. |
Pick up a new property
Say you created an index without dynamic mapping, filled it with data, and then added a mapping value to pick up more fields from the data:
PUT test
{
"mappings": {
"test": {
"dynamic": false,
"properties": {
"text": {"type": "string"}
}
}
}
}
POST test/test?refresh
{
"text": "words words",
"flag": "bar"
}'
POST test/test?refresh
{
"text": "words words",
"flag": "foo"
}'
PUT test/_mapping/test
{
"properties": {
"text": {"type": "string"},
"flag": {"type": "string", "analyzer": "keyword"}
}
}
This means that new fields won’t be indexed, just stored in _source. |
|
This updates the mapping to add the new flag field. To pick up the new
field you have to reindex all documents with it. |
Searching for the data won’t find anything:
POST test/_search?filter_path=hits.total
{
"query": {
"match": {
"flag": "foo"
}
}
}
{
"hits" : {
"total" : 0
}
}
But you can issue an _update_by_query request to pick up the new mapping:
POST test/_update_by_query?refresh&conflicts=proceed
POST test/_search?filter_path=hits.total
{
"query": {
"match": {
"flag": "foo"
}
}
}
{
"hits" : {
"total" : 1
}
}
You can do the exact same thing when adding a field to a multifield.
29. Multi Get API
Multi GET API allows to get multiple documents based on an index, type
(optional) and id (and possibly routing). The response includes a docs
array with all the fetched documents, each element similar in structure
to a document provided by the get
API. Here is an example:
curl 'localhost:9200/_mget' -d '{
"docs" : [
{
"_index" : "test",
"_type" : "type",
"_id" : "1"
},
{
"_index" : "test",
"_type" : "type",
"_id" : "2"
}
]
}'
The mget endpoint can also be used against an index (in which case it
is not required in the body):
curl 'localhost:9200/test/_mget' -d '{
"docs" : [
{
"_type" : "type",
"_id" : "1"
},
{
"_type" : "type",
"_id" : "2"
}
]
}'
And type:
curl 'localhost:9200/test/type/_mget' -d '{
"docs" : [
{
"_id" : "1"
},
{
"_id" : "2"
}
]
}'
In which case, the ids element can directly be used to simplify the
request:
curl 'localhost:9200/test/type/_mget' -d '{
"ids" : ["1", "2"]
}'
Optional Type
The mget API allows for _type to be optional. Set it to _all or leave it empty in order
to fetch the first document matching the id across all types.
If you don’t set the type and have many documents sharing the same _id, you will end up
getting only the first matching document.
For example, if you have a document 1 within typeA and typeB then following request will give you back only the same document twice:
curl 'localhost:9200/test/_mget' -d '{
"ids" : ["1", "1"]
}'
You need in that case to explicitly set the _type:
GET /test/_mget/
{
"docs" : [
{
"_type":"typeA",
"_id" : "1"
},
{
"_type":"typeB",
"_id" : "1"
}
]
}
Source filtering
By default, the _source field will be returned for every document (if stored).
Similar to the get API, you can retrieve only parts of
the _source (or not at all) by using the _source parameter. You can also use
the url parameters _source,_source_include & _source_exclude to specify defaults,
which will be used when there are no per-document instructions.
For example:
curl 'localhost:9200/_mget' -d '{
"docs" : [
{
"_index" : "test",
"_type" : "type",
"_id" : "1",
"_source" : false
},
{
"_index" : "test",
"_type" : "type",
"_id" : "2",
"_source" : ["field3", "field4"]
},
{
"_index" : "test",
"_type" : "type",
"_id" : "3",
"_source" : {
"include": ["user"],
"exclude": ["user.location"]
}
}
]
}'
Fields
Specific stored fields can be specified to be retrieved per document to get, similar to the fields parameter of the Get API. For example:
curl 'localhost:9200/_mget' -d '{
"docs" : [
{
"_index" : "test",
"_type" : "type",
"_id" : "1",
"fields" : ["field1", "field2"]
},
{
"_index" : "test",
"_type" : "type",
"_id" : "2",
"fields" : ["field3", "field4"]
}
]
}'
Alternatively, you can specify the fields parameter in the query string
as a default to be applied to all documents.
curl 'localhost:9200/test/type/_mget?fields=field1,field2' -d '{
"docs" : [
{
"_id" : "1"
},
{
"_id" : "2",
"fields" : ["field3", "field4"]
}
]
}'
Returns field1 and field2 |
|
Returns field3 and field4 |
Generated fields
See Generated fields for fields are generated only when indexing.
Routing
You can also specify routing value as a parameter:
curl 'localhost:9200/_mget?routing=key1' -d '{
"docs" : [
{
"_index" : "test",
"_type" : "type",
"_id" : "1",
"_routing" : "key2"
},
{
"_index" : "test",
"_type" : "type",
"_id" : "2"
}
]
}'
In this example, document test/type/2 will be fetch from shard corresponding to routing key key1 but
document test/type/1 will be fetch from shard corresponding to routing key key2.
Security
30. Bulk API
The bulk API makes it possible to perform many index/delete operations in a single API call. This can greatly increase the indexing speed.
The REST API endpoint is /_bulk, and it expects the following JSON
structure:
action_and_meta_data\n
optional_source\n
action_and_meta_data\n
optional_source\n
....
action_and_meta_data\n
optional_source\n
NOTE: the final line of data must end with a newline character \n.
The possible actions are index, create, delete and update.
index and create expect a source on the next
line, and have the same semantics as the op_type parameter to the
standard index API (i.e. create will fail if a document with the same
index and type exists already, whereas index will add or replace a
document as necessary). delete does not expect a source on the
following line, and has the same semantics as the standard delete API.
update expects that the partial doc, upsert and script and its options
are specified on the next line.
If you’re providing text file input to curl, you must use the
--data-binary flag instead of plain -d. The latter doesn’t preserve
newlines. Example:
$ cat requests
{ "index" : { "_index" : "test", "_type" : "type1", "_id" : "1" } }
{ "field1" : "value1" }
$ curl -s -XPOST localhost:9200/_bulk --data-binary "@requests"; echo
{"took":7,"items":[{"create":{"_index":"test","_type":"type1","_id":"1","_version":1}}]}
Because this format uses literal \n's as delimiters, please be sure
that the JSON actions and sources are not pretty printed. Here is an
example of a correct sequence of bulk commands:
{ "index" : { "_index" : "test", "_type" : "type1", "_id" : "1" } }
{ "field1" : "value1" }
{ "delete" : { "_index" : "test", "_type" : "type1", "_id" : "2" } }
{ "create" : { "_index" : "test", "_type" : "type1", "_id" : "3" } }
{ "field1" : "value3" }
{ "update" : {"_id" : "1", "_type" : "type1", "_index" : "index1"} }
{ "doc" : {"field2" : "value2"} }
In the above example doc for the update action is a partial
document, that will be merged with the already stored document.
The endpoints are /_bulk, /{index}/_bulk, and {index}/{type}/_bulk.
When the index or the index/type are provided, they will be used by
default on bulk items that don’t provide them explicitly.
A note on the format. The idea here is to make processing of this as
fast as possible. As some of the actions will be redirected to other
shards on other nodes, only action_meta_data is parsed on the
receiving node side.
Client libraries using this protocol should try and strive to do something similar on the client side, and reduce buffering as much as possible.
The response to a bulk action is a large JSON structure with the individual results of each action that was performed. The failure of a single action does not affect the remaining actions.
There is no "correct" number of actions to perform in a single bulk call. You should experiment with different settings to find the optimum size for your particular workload.
If using the HTTP API, make sure that the client does not send HTTP chunks, as this will slow things down.
Versioning
Each bulk item can include the version value using the
_version/version field. It automatically follows the behavior of the
index / delete operation based on the _version mapping. It also
support the version_type/_version_type (see versioning)
Routing
Each bulk item can include the routing value using the
_routing/routing field. It automatically follows the behavior of the
index / delete operation based on the _routing mapping.
Parent
Each bulk item can include the parent value using the _parent/parent
field. It automatically follows the behavior of the index / delete
operation based on the _parent / _routing mapping.
Timestamp
deprecated[2.0.0-beta2,The _timestamp field is deprecated. Instead, use a normal date field and set its value explicitly]
Each bulk item can include the timestamp value using the
_timestamp/timestamp field. It automatically follows the behavior of
the index operation based on the _timestamp mapping.
TTL
deprecated[2.0.0-beta2,The current _ttl implementation is deprecated and will be replaced with a different implementation in a future version]
Each bulk item can include the ttl value using the _ttl/ttl field.
It automatically follows the behavior of the index operation based on
the _ttl mapping.
Write Consistency
When making bulk calls, you can require a minimum number of active
shards in the partition through the consistency parameter. The values
allowed are one, quorum, and all. It defaults to the node level
setting of action.write_consistency, which in turn defaults to
quorum.
For example, in a N shards with 2 replicas index, there will have to be
at least 2 active shards within the relevant partition (quorum) for
the operation to succeed. In a N shards with 1 replica scenario, there
will need to be a single shard active (in this case, one and quorum
is the same).
Refresh
The refresh parameter can be set to true in order to refresh the relevant
primary and replica shards immediately after the bulk operation has occurred
and make it searchable, instead of waiting for the normal refresh interval to
expire. Setting it to true can trigger additional load, and may slow down
indexing. Due to its costly nature, the refresh parameter is set on the bulk request level
and is not supported on each individual bulk item.
Update
When using update action _retry_on_conflict can be used as field in
the action itself (not in the extra payload line), to specify how many
times an update should be retried in the case of a version conflict.
The update action payload, supports the following options: doc
(partial document), upsert, doc_as_upsert, script, params (for
script), lang (for script) and fields. See update documentation for details on
the options. Curl example with update actions:
{ "update" : {"_id" : "1", "_type" : "type1", "_index" : "index1", "_retry_on_conflict" : 3} }
{ "doc" : {"field" : "value"} }
{ "update" : { "_id" : "0", "_type" : "type1", "_index" : "index1", "_retry_on_conflict" : 3} }
{ "script" : { "inline": "ctx._source.counter += param1", "lang" : "js", "params" : {"param1" : 1}}, "upsert" : {"counter" : 1}}
{ "update" : {"_id" : "2", "_type" : "type1", "_index" : "index1", "_retry_on_conflict" : 3} }
{ "doc" : {"field" : "value"}, "doc_as_upsert" : true }
{ "update" : {"_id" : "3", "_type" : "type1", "_index" : "index1", "fields" : ["_source"]} }
{ "doc" : {"field" : "value"} }
{ "update" : {"_id" : "4", "_type" : "type1", "_index" : "index1"} }
{ "doc" : {"field" : "value"}, "fields": ["_source"]}
Security
31. Reindex API
experimental[The reindex API is new and should still be considered experimental. The API may change in ways that are not backwards compatible]
The most basic form of _reindex just copies documents from one index to another.
This will copy documents from the twitter index into the new_twitter index:
POST /_reindex
{
"source": {
"index": "twitter"
},
"dest": {
"index": "new_twitter"
}
}
That will return something like this:
{
"took" : 639,
"updated": 112,
"batches": 130,
"version_conflicts": 0,
"failures" : [ ],
"created": 12344
}
Just like _update_by_query, _reindex gets a
snapshot of the source index but its target must be a different index so
version conflicts are unlikely. The dest element can be configured like the
index API to control optimistic concurrency control. Just leaving out
version_type (as above) or setting it to internal will cause Elasticsearch
to blindly dump documents into the target, overwriting any that happen to have
the same type and id:
POST /_reindex
{
"source": {
"index": "twitter"
},
"dest": {
"index": "new_twitter",
"version_type": "internal"
}
}
Setting version_type to external will cause Elasticsearch to preserve the
version from the source, create any documents that are missing, and update
any documents that have an older version in the destination index than they do
in the source index:
POST /_reindex
{
"source": {
"index": "twitter"
},
"dest": {
"index": "new_twitter",
"version_type": "external"
}
}
Settings op_type to create will cause _reindex to only create missing
documents in the target index. All existing documents will cause a version
conflict:
POST /_reindex
{
"source": {
"index": "twitter"
},
"dest": {
"index": "new_twitter",
"op_type": "create"
}
}
By default version conflicts abort the _reindex process but you can just
count them by settings "conflicts": "proceed" in the request body:
POST /_reindex
{
"conflicts": "proceed",
"source": {
"index": "twitter"
},
"dest": {
"index": "new_twitter",
"op_type": "create"
}
}
You can limit the documents by adding a type to the source or by adding a
query. This will only copy tweet's made by kimchy into new_twitter:
POST /_reindex
{
"source": {
"index": "twitter",
"type": "tweet",
"query": {
"term": {
"user": "kimchy"
}
}
},
"dest": {
"index": "new_twitter"
}
}
index and type in source can both be lists, allowing you to copy from
lots of sources in one request. This will copy documents from the tweet and
post types in the twitter and blog index. It’d include the post type in
the twitter index and the tweet type in the blog index. If you want to be
more specific you’ll need to use the query. It also makes no effort to handle
ID collisions. The target index will remain valid but it’s not easy to predict
which document will survive because the iteration order isn’t well defined.
POST /_reindex
{
"source": {
"index": ["twitter", "blog"],
"type": ["tweet", "post"]
},
"dest": {
"index": "all_together"
}
}
It’s also possible to limit the number of processed documents by setting
size. This will only copy a single document from twitter to
new_twitter:
POST /_reindex
{
"size": 1,
"source": {
"index": "twitter"
},
"dest": {
"index": "new_twitter"
}
}
If you want a particular set of documents from the twitter index you’ll
need to sort. Sorting makes the scroll less efficient but in some contexts
it’s worth it. If possible, prefer a more selective query to size and sort.
This will copy 10000 documents from twitter into new_twitter:
POST /_reindex
{
"size": 10000,
"source": {
"index": "twitter",
"sort": { "date": "desc" }
},
"dest": {
"index": "new_twitter"
}
}
Like _update_by_query, _reindex supports a script that modifies the
document. Unlike _update_by_query, the script is allowed to modify the
document’s metadata. This example bumps the version of the source document:
POST /_reindex
{
"source": {
"index": "twitter",
},
"dest": {
"index": "new_twitter",
"version_type": "external"
}
"script": {
"internal": "if (ctx._source.foo == 'bar') {ctx._version++; ctx._source.remove('foo')}"
}
}
Think of the possibilities! Just be careful! With great power…. You can change:
-
_id -
_type -
_index -
_version -
_routing -
_parent -
_timestamp -
_ttl
Setting _version to null or clearing it from the ctx map is just like not
sending the version in an indexing request. It will cause that document to be
overwritten in the target index regardless of the version on the target or the
version type you use in the _reindex request.
By default if _reindex sees a document with routing then the routing is
preserved unless it’s changed by the script. You can set routing on the
dest request to change this:
keep-
Sets the routing on the bulk request sent for each match to the routing on the match. The default.
discard-
Sets the routing on the bulk request sent for each match to null.
=<some text>-
Sets the routing on the bulk request sent for each match to all text after the
=.
For example, you can use the following request to copy all documents from
the source index with the company name cat into the dest index with
routing set to cat.
POST /_reindex
{
"source": {
"index": "source"
"query": {
"match": {
"company": "cat"
}
}
}
"dest": {
"index": "dest",
"routing": "=cat"
}
}
URL Parameters
In addition to the standard parameters like pretty, the Reindex API also
supports refresh, wait_for_completion, consistency, and timeout.
Sending the refresh url parameter will cause all indexes to which the request
wrote to be refreshed. This is different than the Index API’s refresh
parameter which causes just the shard that received the new data to be indexed.
If the request contains wait_for_completion=false then Elasticsearch will
perform some preflight checks, launch the request, and then return a task
which can be used with Tasks APIs to cancel or get
the status of the task. For now, once the request is finished the task is gone
and the only place to look for the ultimate result of the task is in the
Elasticsearch log file. This will be fixed soon.
consistency controls how many copies of a shard must respond to each write
request. timeout controls how long each write request waits for unavailable
shards to become available. Both work exactly how they work in the
Bulk API.
timeout controls how long each batch waits for the target shard to become
available. It works exactly how it works in the {ref}/docs-bulk.html[Bulk API].
Response body
The JSON response looks like this:
{
"took" : 639,
"updated": 0,
"created": 123,
"batches": 1,
"version_conflicts": 2,
"failures" : [ ]
}
took-
The number of milliseconds from start to end of the whole operation.
updated-
The number of documents that were successfully updated.
created-
The number of documents that were successfully created.
batches-
The number of scroll responses pulled back by the the reindex.
version_conflicts-
The number of version conflicts that reindex hit.
failures-
Array of all indexing failures. If this is non-empty then the request aborted because of those failures. See
conflictsfor how to prevent version conflicts from aborting the operation.
Works with the Task API
While Reindex is running you can fetch their status using the Task API:
GET /_tasks/?pretty&detailed=true&actions=*reindex
The responses looks like:
{
"nodes" : {
"r1A2WoRbTwKZ516z6NEs5A" : {
"name" : "Tyrannus",
"transport_address" : "127.0.0.1:9300",
"host" : "127.0.0.1",
"ip" : "127.0.0.1:9300",
"attributes" : {
"testattr" : "test",
"portsfile" : "true"
},
"tasks" : {
"r1A2WoRbTwKZ516z6NEs5A:36619" : {
"node" : "r1A2WoRbTwKZ516z6NEs5A",
"id" : 36619,
"type" : "transport",
"action" : "indices:data/write/reindex",
"status" : {
"total" : 6154,
"updated" : 3500,
"created" : 0,
"deleted" : 0,
"batches" : 36,
"version_conflicts" : 0,
"noops" : 0
},
"description" : ""
}
}
}
}
}
this object contains the actual status. It is just like the response json
with the important addition of the total field. total is the total number
of operations that the reindex expects to perform. You can estimate the
progress by adding the updated, created, and deleted fields. The request
will finish when their sum is equal to the total field. |
Reindex to change the name of a field
_reindex can be used to build a copy of an index with renamed fields. Say you
create an index containing documents that look like this:
POST test/test/1?refresh&pretty
{
"text": "words words",
"flag": "foo"
}
But you don’t like the name flag and want to replace it with tag.
_reindex can create the other index for you:
POST _reindex?pretty
{
"source": {
"index": "test"
},
"dest": {
"index": "test2"
},
"script": {
"inline": "ctx._source.tag = ctx._source.remove(\"flag\")"
}
}
Now you can get the new document:
GET test2/test/1?pretty
and it’ll look like:
{
"text": "words words",
"tag": "foo"
}
Or you can search by tag or whatever you want.
32. Term Vectors
Returns information and statistics on terms in the fields of a particular
document. The document could be stored in the index or artificially provided
by the user. Term vectors are realtime by default, not near
realtime. This can be changed by setting realtime parameter to false.
curl -XGET 'http://localhost:9200/twitter/tweet/1/_termvectors?pretty=true'
Optionally, you can specify the fields for which the information is retrieved either with a parameter in the url
curl -XGET 'http://localhost:9200/twitter/tweet/1/_termvectors?fields=text,...'
or by adding the requested fields in the request body (see example below). Fields can also be specified with wildcards in similar way to the multi match query
|
|
Note that the usage of /_termvector is deprecated in 2.0, and replaced by /_termvectors.
|
Return values
Three types of values can be requested: term information, term statistics and field statistics. By default, all term information and field statistics are returned for all fields but no term statistics.
Term information
-
term frequency in the field (always returned)
-
term positions (
positions: true) -
start and end offsets (
offsets: true) -
term payloads (
payloads: true), as base64 encoded bytes
If the requested information wasn’t stored in the index, it will be computed on the fly if possible. Additionally, term vectors could be computed for documents not even existing in the index, but instead provided by the user.
|
|
Start and end offsets assume UTF-16 encoding is being used. If you want to use these offsets in order to get the original text that produced this token, you should make sure that the string you are taking a sub-string of is also encoded using UTF-16. |
Term statistics
Setting term_statistics to true (default is false) will
return
-
total term frequency (how often a term occurs in all documents)
-
document frequency (the number of documents containing the current term)
By default these values are not returned since term statistics can have a serious performance impact.
Field statistics
Setting field_statistics to false (default is true) will
omit :
-
document count (how many documents contain this field)
-
sum of document frequencies (the sum of document frequencies for all terms in this field)
-
sum of total term frequencies (the sum of total term frequencies of each term in this field)
Distributed frequencies
Setting dfs to true (default is false) will return the term statistics
or the field statistics of the entire index, and not just at the shard. Use it
with caution as distributed frequencies can have a serious performance impact.
Terms Filtering
With the parameter filter, the terms returned could also be filtered based
on their tf-idf scores. This could be useful in order find out a good
characteristic vector of a document. This feature works in a similar manner to
the second phase of the
More Like This Query. See example 5
for usage.
The following sub-parameters are supported:
max_num_terms
|
Maximum number of terms that must be returned per field. Defaults to |
min_term_freq
|
Ignore words with less than this frequency in the source doc. Defaults to |
max_term_freq
|
Ignore words with more than this frequency in the source doc. Defaults to unbounded. |
min_doc_freq
|
Ignore terms which do not occur in at least this many docs. Defaults to |
max_doc_freq
|
Ignore words which occur in more than this many docs. Defaults to unbounded. |
min_word_length
|
The minimum word length below which words will be ignored. Defaults to |
max_word_length
|
The maximum word length above which words will be ignored. Defaults to unbounded ( |
Behaviour
The term and field statistics are not accurate. Deleted documents
are not taken into account. The information is only retrieved for the
shard the requested document resides in, unless dfs is set to true.
The term and field statistics are therefore only useful as relative measures
whereas the absolute numbers have no meaning in this context. By default,
when requesting term vectors of artificial documents, a shard to get the statistics
from is randomly selected. Use routing only to hit a particular shard.
First, we create an index that stores term vectors, payloads etc. :
curl -s -XPUT 'http://localhost:9200/twitter/' -d '{
"mappings": {
"tweet": {
"properties": {
"text": {
"type": "string",
"term_vector": "with_positions_offsets_payloads",
"store" : true,
"analyzer" : "fulltext_analyzer"
},
"fullname": {
"type": "string",
"term_vector": "with_positions_offsets_payloads",
"analyzer" : "fulltext_analyzer"
}
}
}
},
"settings" : {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 0
},
"analysis": {
"analyzer": {
"fulltext_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"type_as_payload"
]
}
}
}
}
}'
Second, we add some documents:
curl -XPUT 'http://localhost:9200/twitter/tweet/1?pretty=true' -d '{
"fullname" : "John Doe",
"text" : "twitter test test test "
}'
curl -XPUT 'http://localhost:9200/twitter/tweet/2?pretty=true' -d '{
"fullname" : "Jane Doe",
"text" : "Another twitter test ..."
}'
The following request returns all information and statistics for field
text in document 1 (John Doe):
curl -XGET 'http://localhost:9200/twitter/tweet/1/_termvectors?pretty=true' -d '{
"fields" : ["text"],
"offsets" : true,
"payloads" : true,
"positions" : true,
"term_statistics" : true,
"field_statistics" : true
}'
Response:
{
"_id": "1",
"_index": "twitter",
"_type": "tweet",
"_version": 1,
"found": true,
"term_vectors": {
"text": {
"field_statistics": {
"doc_count": 2,
"sum_doc_freq": 6,
"sum_ttf": 8
},
"terms": {
"test": {
"doc_freq": 2,
"term_freq": 3,
"tokens": [
{
"end_offset": 12,
"payload": "d29yZA==",
"position": 1,
"start_offset": 8
},
{
"end_offset": 17,
"payload": "d29yZA==",
"position": 2,
"start_offset": 13
},
{
"end_offset": 22,
"payload": "d29yZA==",
"position": 3,
"start_offset": 18
}
],
"ttf": 4
},
"twitter": {
"doc_freq": 2,
"term_freq": 1,
"tokens": [
{
"end_offset": 7,
"payload": "d29yZA==",
"position": 0,
"start_offset": 0
}
],
"ttf": 2
}
}
}
}
}
Term vectors which are not explicitly stored in the index are automatically
computed on the fly. The following request returns all information and statistics for the
fields in document 1, even though the terms haven’t been explicitly stored in the index.
Note that for the field text, the terms are not re-generated.
curl -XGET 'http://localhost:9200/twitter/tweet/1/_termvectors?pretty=true' -d '{
"fields" : ["text", "some_field_without_term_vectors"],
"offsets" : true,
"positions" : true,
"term_statistics" : true,
"field_statistics" : true
}'
Term vectors can also be generated for artificial documents,
that is for documents not present in the index. The syntax is similar to the
percolator API. For example, the following request would
return the same results as in example 1. The mapping used is determined by the
index and type.
If dynamic mapping is turned on (default), the document fields not in the original mapping will be dynamically created.
curl -XGET 'http://localhost:9200/twitter/tweet/_termvectors' -d '{
"doc" : {
"fullname" : "John Doe",
"text" : "twitter test test test"
}
}'
Additionally, a different analyzer than the one at the field may be provided
by using the per_field_analyzer parameter. This is useful in order to
generate term vectors in any fashion, especially when using artificial
documents. When providing an analyzer for a field that already stores term
vectors, the term vectors will be re-generated.
curl -XGET 'http://localhost:9200/twitter/tweet/_termvectors' -d '{
"doc" : {
"fullname" : "John Doe",
"text" : "twitter test test test"
},
"fields": ["fullname"],
"per_field_analyzer" : {
"fullname": "keyword"
}
}'
Response:
{
"_index": "twitter",
"_type": "tweet",
"_version": 0,
"found": true,
"term_vectors": {
"fullname": {
"field_statistics": {
"sum_doc_freq": 1,
"doc_count": 1,
"sum_ttf": 1
},
"terms": {
"John Doe": {
"term_freq": 1,
"tokens": [
{
"position": 0,
"start_offset": 0,
"end_offset": 8
}
]
}
}
}
}
}
Finally, the terms returned could be filtered based on their tf-idf scores. In the example below we obtain the three most "interesting" keywords from the artificial document having the given "plot" field value. Additionally, we are asking for distributed frequencies to obtain more accurate results. Notice that the keyword "Tony" or any stop words are not part of the response, as their tf-idf must be too low.
GET /imdb/movies/_termvectors
{
"doc": {
"plot": "When wealthy industrialist Tony Stark is forced to build an armored suit after a life-threatening incident, he ultimately decides to use its technology to fight against evil."
},
"term_statistics" : true,
"field_statistics" : true,
"dfs": true,
"positions": false,
"offsets": false,
"filter" : {
"max_num_terms" : 3,
"min_term_freq" : 1,
"min_doc_freq" : 1
}
}
Response:
{
"_index": "imdb",
"_type": "movies",
"_version": 0,
"found": true,
"term_vectors": {
"plot": {
"field_statistics": {
"sum_doc_freq": 3384269,
"doc_count": 176214,
"sum_ttf": 3753460
},
"terms": {
"armored": {
"doc_freq": 27,
"ttf": 27,
"term_freq": 1,
"score": 9.74725
},
"industrialist": {
"doc_freq": 88,
"ttf": 88,
"term_freq": 1,
"score": 8.590818
},
"stark": {
"doc_freq": 44,
"ttf": 47,
"term_freq": 1,
"score": 9.272792
}
}
}
}
}
33. Multi termvectors API
Multi termvectors API allows to get multiple termvectors at once. The
documents from which to retrieve the term vectors are specified by an index,
type and id. But the documents could also be artificially provided
The response includes a docs
array with all the fetched termvectors, each element having the structure
provided by the termvectors
API. Here is an example:
curl 'localhost:9200/_mtermvectors' -d '{
"docs": [
{
"_index": "testidx",
"_type": "test",
"_id": "2",
"term_statistics": true
},
{
"_index": "testidx",
"_type": "test",
"_id": "1",
"fields": [
"text"
]
}
]
}'
See the termvectors API for a description of possible parameters.
The _mtermvectors endpoint can also be used against an index (in which case it
is not required in the body):
curl 'localhost:9200/testidx/_mtermvectors' -d '{
"docs": [
{
"_type": "test",
"_id": "2",
"fields": [
"text"
],
"term_statistics": true
},
{
"_type": "test",
"_id": "1"
}
]
}'
And type:
curl 'localhost:9200/testidx/test/_mtermvectors' -d '{
"docs": [
{
"_id": "2",
"fields": [
"text"
],
"term_statistics": true
},
{
"_id": "1"
}
]
}'
If all requested documents are on same index and have same type and also the parameters are the same, the request can be simplified:
curl 'localhost:9200/testidx/test/_mtermvectors' -d '{
"ids" : ["1", "2"],
"parameters": {
"fields": [
"text"
],
"term_statistics": true,
…
}
}'
Additionally, just like for the termvectors
API, term vectors could be generated for user provided documents. The syntax
is similar to the percolator API. The mapping used is
determined by _index and _type.
curl 'localhost:9200/_mtermvectors' -d '{
"docs": [
{
"_index": "testidx",
"_type": "test",
"doc" : {
"fullname" : "John Doe",
"text" : "twitter test test test"
}
},
{
"_index": "testidx",
"_type": "test",
"doc" : {
"fullname" : "Jane Doe",
"text" : "Another twitter test ..."
}
}
]
}'
Search APIs
Most search APIs are multi-index, multi-type, with the exception of the Explain API endpoints.
Routing
When executing a search, it will be broadcast to all the index/indices
shards (round robin between replicas). Which shards will be searched on
can be controlled by providing the routing parameter. For example,
when indexing tweets, the routing value can be the user name:
$ curl -XPOST 'http://localhost:9200/twitter/tweet?routing=kimchy' -d '{
"user" : "kimchy",
"postDate" : "2009-11-15T14:12:12",
"message" : "trying out Elasticsearch"
}
'
In such a case, if we want to search only on the tweets for a specific user, we can specify it as the routing, resulting in the search hitting only the relevant shard:
$ curl -XGET 'http://localhost:9200/twitter/tweet/_search?routing=kimchy' -d '{
"query": {
"bool" : {
"must" : {
"query_string" : {
"query" : "some query string here"
}
},
"filter" : {
"term" : { "user" : "kimchy" }
}
}
}
}
'
The routing parameter can be multi valued represented as a comma separated string. This will result in hitting the relevant shards where the routing values match to.
Stats Groups
A search can be associated with stats groups, which maintains a statistics aggregation per group. It can later be retrieved using the indices stats API specifically. For example, here is a search body request that associate the request with two different groups:
{
"query" : {
"match_all" : {}
},
"stats" : ["group1", "group2"]
}
Global Search Timeout
Individual searches can have a timeout as part of the
Request Body Search. Since search requests can originate from many
sources, Elasticsearch has a dynamic cluster-level setting for a global
search timeout that applies to all search requests that do not set a
timeout in the Request Body Search. The default value is no global
timeout. The setting key is search.default_search_timeout and can be
set using the Cluster Update Settings endpoints. Setting this value
to -1 resets the global search timeout to no timeout.
34. Search
The search API allows to execute a search query and get back search hits that match the query. The query can either be provided using a simple query string as a parameter, or using a request body.
Multi-Index, Multi-Type
All search APIs can be applied across multiple types within an index, and across multiple indices with support for the multi index syntax. For example, we can search on all documents across all types within the twitter index:
$ curl -XGET 'http://localhost:9200/twitter/_search?q=user:kimchy'
We can also search within specific types:
$ curl -XGET 'http://localhost:9200/twitter/tweet,user/_search?q=user:kimchy'
We can also search all tweets with a certain tag across several indices (for example, when each user has his own index):
$ curl -XGET 'http://localhost:9200/kimchy,elasticsearch/tweet/_search?q=tag:wow'
Or we can search all tweets across all available indices using _all
placeholder:
$ curl -XGET 'http://localhost:9200/_all/tweet/_search?q=tag:wow'
Or even search across all indices and all types:
$ curl -XGET 'http://localhost:9200/_search?q=tag:wow'
35. URI Search
A search request can be executed purely using a URI by providing request parameters. Not all search options are exposed when executing a search using this mode, but it can be handy for quick "curl tests". Here is an example:
$ curl -XGET 'http://localhost:9200/twitter/tweet/_search?q=user:kimchy'
And here is a sample response:
{
"_shards":{
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits":{
"total" : 1,
"hits" : [
{
"_index" : "twitter",
"_type" : "tweet",
"_id" : "1",
"_source" : {
"user" : "kimchy",
"postDate" : "2009-11-15T14:12:12",
"message" : "trying out Elasticsearch"
}
}
]
}
}
Parameters
The parameters allowed in the URI are:
| Name | Description |
|---|---|
|
The query string (maps to the |
|
The default field to use when no field prefix is defined within the query. |
|
The analyzer name to be used when analyzing the query string. |
|
Should terms be automatically lowercased or
not. Defaults to |
|
Should wildcard and prefix queries be analyzed or
not. Defaults to |
|
The default operator to be used, can be |
|
If set to true will cause format based failures (like providing text to a numeric field) to be ignored. Defaults to false. |
|
For each hit, contain an explanation of how scoring of the hits was computed. |
|
Set to |
|
The selective stored fields of the document to return for each hit, comma delimited. Not specifying any value will cause no fields to return. |
|
Sorting to perform. Can either be in the form of |
|
When sorting, set to |
|
A search timeout, bounding the search request to be executed within the specified time value and bail with the hits accumulated up to that point when expired. Defaults to no timeout. |
|
The maximum number of documents to collect for
each shard, upon reaching which the query execution will terminate early.
If set, the response will have a boolean field |
|
The starting from index of the hits to return. Defaults to |
|
The number of hits to return. Defaults to |
|
The type of the search operation to perform. Can be
|
36. Request Body Search
The search request can be executed with a search DSL, which includes the Query DSL, within its body. Here is an example:
$ curl -XGET 'http://localhost:9200/twitter/tweet/_search' -d '{
"query" : {
"term" : { "user" : "kimchy" }
}
}
'
And here is a sample response:
{
"_shards":{
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits":{
"total" : 1,
"hits" : [
{
"_index" : "twitter",
"_type" : "tweet",
"_id" : "1",
"_source" : {
"user" : "kimchy",
"postDate" : "2009-11-15T14:12:12",
"message" : "trying out Elasticsearch"
}
}
]
}
}
Parameters
timeout
|
A search timeout, bounding the search request to be executed within the specified time value and bail with the hits accumulated up to that point when expired. Defaults to no timeout. See Time units. |
from
|
The starting from index of the hits to return. Defaults to |
size
|
The number of hits to return. Defaults to |
search_type
|
The type of the search operation to perform. Can be
|
request_cache
|
Set to |
terminate_after
|
The maximum number of documents to collect for each shard,
upon reaching which the query execution will terminate early. If set, the
response will have a boolean field |
Out of the above, the search_type and the request_cache must be passed as
query-string parameters. The rest of the search request should be passed
within the body itself. The body content can also be passed as a REST
parameter named source.
Both HTTP GET and HTTP POST can be used to execute search with body. Since not all clients support GET with body, POST is allowed as well.
36.1. Query
The query element within the search request body allows to define a query using the Query DSL.
{
"query" : {
"term" : { "user" : "kimchy" }
}
}
36.2. From / Size
Pagination of results can be done by using the from and size
parameters. The from parameter defines the offset from the first
result you want to fetch. The size parameter allows you to configure
the maximum amount of hits to be returned.
Though from and size can be set as request parameters, they can also
be set within the search body. from defaults to 0, and size
defaults to 10.
{
"from" : 0, "size" : 10,
"query" : {
"term" : { "user" : "kimchy" }
}
}
Note that from + size can not be more than the index.max_result_window
index setting which defaults to 10,000. See the Scroll
API for more efficient ways to do deep scrolling.
36.3. Sort
Allows to add one or more sort on specific fields. Each sort can be
reversed as well. The sort is defined on a per field level, with special
field name for _score to sort by score, and _doc to sort by index order.
{
"sort" : [
{ "post_date" : {"order" : "asc"}},
"user",
{ "name" : "desc" },
{ "age" : "desc" },
"_score"
],
"query" : {
"term" : { "user" : "kimchy" }
}
}
|
|
_doc has no real use-case besides being the most efficient sort order.
So if you don’t care about the order in which documents are returned, then you
should sort by _doc. This especially helps when scrolling.
|
36.3.1. Sort Values
The sort values for each document returned are also returned as part of the response.
36.3.2. Sort Order
The order option can have the following values:
asc
|
Sort in ascending order |
desc
|
Sort in descending order |
The order defaults to desc when sorting on the _score, and defaults
to asc when sorting on anything else.
36.3.3. Sort mode option
Elasticsearch supports sorting by array or multi-valued fields. The mode option
controls what array value is picked for sorting the document it belongs
to. The mode option can have the following values:
min
|
Pick the lowest value. |
max
|
Pick the highest value. |
sum
|
Use the sum of all values as sort value. Only applicable for number based array fields. |
avg
|
Use the average of all values as sort value. Only applicable for number based array fields. |
median
|
Use the median of all values as sort value. Only applicable for number based array fields. |
Sort mode example usage
In the example below the field price has multiple prices per document. In this case the result hits will be sort by price ascending based on the average price per document.
curl -XPOST 'localhost:9200/_search' -d '{
"query" : {
...
},
"sort" : [
{"price" : {"order" : "asc", "mode" : "avg"}}
]
}'
36.3.4. Sorting within nested objects.
Elasticsearch also supports sorting by fields that are inside one or more nested objects. The sorting by nested field support has the following parameters on top of the already existing sort options:
nested_path-
Defines on which nested object to sort. The actual sort field must be a direct field inside this nested object. When sorting by nested field, this field is mandatory.
nested_filter-
A filter that the inner objects inside the nested path should match with in order for its field values to be taken into account by sorting. Common case is to repeat the query / filter inside the nested filter or query. By default no
nested_filteris active.
Nested sorting example
In the below example offer is a field of type nested.
The nested_path needs to be specified; otherwise, elasticsearch doesn’t know on what nested level sort values need to be captured.
curl -XPOST 'localhost:9200/_search' -d '{
"query" : {
...
},
"sort" : [
{
"offer.price" : {
"mode" : "avg",
"order" : "asc",
"nested_path" : "offer",
"nested_filter" : {
"term" : { "offer.color" : "blue" }
}
}
}
]
}'
Nested sorting is also supported when sorting by scripts and sorting by geo distance.
36.3.5. Missing Values
The missing parameter specifies how docs which are missing
the field should be treated: The missing value can be
set to _last, _first, or a custom value (that
will be used for missing docs as the sort value). For example:
{
"sort" : [
{ "price" : {"missing" : "_last"} },
],
"query" : {
"term" : { "user" : "kimchy" }
}
}
|
|
If a nested inner object doesn’t match with
the nested_filter then a missing value is used.
|
36.3.6. Ignoring Unmapped Fields
By default, the search request will fail if there is no mapping
associated with a field. The unmapped_type option allows to ignore
fields that have no mapping and not sort by them. The value of this
parameter is used to determine what sort values to emit. Here is an
example of how it can be used:
{
"sort" : [
{ "price" : {"unmapped_type" : "long"} },
],
"query" : {
"term" : { "user" : "kimchy" }
}
}
If any of the indices that are queried doesn’t have a mapping for price
then Elasticsearch will handle it as if there was a mapping of type
long, with all documents in this index having no value for this field.
36.3.7. Geo Distance Sorting
Allow to sort by _geo_distance. Here is an example:
{
"sort" : [
{
"_geo_distance" : {
"pin.location" : [-70, 40],
"order" : "asc",
"unit" : "km",
"mode" : "min",
"distance_type" : "sloppy_arc"
}
}
],
"query" : {
"term" : { "user" : "kimchy" }
}
}
distance_type-
How to compute the distance. Can either be
sloppy_arc(default),arc(slightly more precise but significantly slower) orplane(faster, but inaccurate on long distances and close to the poles).
Note: the geo distance sorting supports sort_mode options: min,
max and avg.
The following formats are supported in providing the coordinates:
Lat Lon as Properties
{
"sort" : [
{
"_geo_distance" : {
"pin.location" : {
"lat" : 40,
"lon" : -70
},
"order" : "asc",
"unit" : "km"
}
}
],
"query" : {
"term" : { "user" : "kimchy" }
}
}
Lat Lon as String
Format in lat,lon.
{
"sort" : [
{
"_geo_distance" : {
"pin.location" : "40,-70",
"order" : "asc",
"unit" : "km"
}
}
],
"query" : {
"term" : { "user" : "kimchy" }
}
}
Geohash
{
"sort" : [
{
"_geo_distance" : {
"pin.location" : "drm3btev3e86",
"order" : "asc",
"unit" : "km"
}
}
],
"query" : {
"term" : { "user" : "kimchy" }
}
}
Lat Lon as Array
Format in [lon, lat], note, the order of lon/lat here in order to
conform with GeoJSON.
{
"sort" : [
{
"_geo_distance" : {
"pin.location" : [-70, 40],
"order" : "asc",
"unit" : "km"
}
}
],
"query" : {
"term" : { "user" : "kimchy" }
}
}
36.3.8. Multiple reference points
Multiple geo points can be passed as an array containing any geo_point format, for example
"pin.location" : [[-70, 40], [-71, 42]]
"pin.location" : [{"lat": 40, "lon": -70}, {"lat": 42, "lon": -71}]
and so forth.
The final distance for a document will then be min/max/avg (defined via mode) distance of all points contained in the document to all points given in the sort request.
36.3.9. Script Based Sorting
Allow to sort based on custom scripts, here is an example:
{
"query" : {
....
},
"sort" : {
"_script" : {
"type" : "number",
"script" : {
"inline": "doc['field_name'].value * factor",
"params" : {
"factor" : 1.1
}
},
"order" : "asc"
}
}
}
36.3.10. Track Scores
When sorting on a field, scores are not computed. By setting
track_scores to true, scores will still be computed and tracked.
{
"track_scores": true,
"sort" : [
{ "post_date" : {"reverse" : true} },
{ "name" : "desc" },
{ "age" : "desc" }
],
"query" : {
"term" : { "user" : "kimchy" }
}
}
36.3.11. Memory Considerations
When sorting, the relevant sorted field values are loaded into memory.
This means that per shard, there should be enough memory to contain
them. For string based types, the field sorted on should not be analyzed
/ tokenized. For numeric types, if possible, it is recommended to
explicitly set the type to narrower types (like short, integer and
float).
36.4. Source filtering
Allows to control how the _source field is returned with every hit.
By default operations return the contents of the _source field unless
you have used the fields parameter or if the _source field is disabled.
You can turn off _source retrieval by using the _source parameter:
To disable _source retrieval set to false:
{
"_source": false,
"query" : {
"term" : { "user" : "kimchy" }
}
}
The _source also accepts one or more wildcard patterns to control what parts of the _source should be returned:
For example:
{
"_source": "obj.*",
"query" : {
"term" : { "user" : "kimchy" }
}
}
Or
{
"_source": [ "obj1.*", "obj2.*" ],
"query" : {
"term" : { "user" : "kimchy" }
}
}
Finally, for complete control, you can specify both include and exclude patterns:
{
"_source": {
"include": [ "obj1.*", "obj2.*" ],
"exclude": [ "*.description" ]
},
"query" : {
"term" : { "user" : "kimchy" }
}
}
36.5. Fields
|
|
The fields parameter is about fields that are explicitly marked as
stored in the mapping, which is off by default and generally not recommended.
Use source filtering instead to select
subsets of the original source document to be returned.
|
Allows to selectively load specific stored fields for each document represented by a search hit.
{
"fields" : ["user", "postDate"],
"query" : {
"term" : { "user" : "kimchy" }
}
}
fields also accepts one or more wildcard patterns to control which fields of the document should be returned.
WARNING: Only stored fields can be retrieved with wildcard patterns.
For example:
{
"fields": "["xxx*", "*xxx", "*xxx*", "xxx*yyy", "user", "postDate"],
"query" : {
"term" : { "user" : "kimchy" }
}
}
* can be used to load all stored fields from the document.
An empty array will cause only the _id and _type for each hit to be
returned, for example:
{
"fields" : [],
"query" : {
"term" : { "user" : "kimchy" }
}
}
For backwards compatibility, if the fields parameter specifies fields which are not stored (store mapping set to
false), it will load the _source and extract it from it. This functionality has been replaced by the
source filtering parameter.
Field values fetched from the document it self are always returned as an array. Metadata fields like _routing and
_parent fields are never returned as an array.
Also only leaf fields can be returned via the field option. So object fields can’t be returned and such requests
will fail.
Script fields can also be automatically detected and used as fields, so
things like _source.obj1.field1 can be used, though not recommended, as
obj1.field1 will work as well.
36.6. Script Fields
Allows to return a script evaluation (based on different fields) for each hit, for example:
{
"query" : {
...
},
"script_fields" : {
"test1" : {
"script" : "doc['my_field_name'].value * 2"
},
"test2" : {
"script" : {
"inline": "doc['my_field_name'].value * factor",
"params" : {
"factor" : 2.0
}
}
}
}
}
Script fields can work on fields that are not stored (my_field_name in
the above case), and allow to return custom values to be returned (the
evaluated value of the script).
Script fields can also access the actual _source document indexed and
extract specific elements to be returned from it (can be an "object"
type). Here is an example:
{
"query" : {
...
},
"script_fields" : {
"test1" : {
"script" : "_source.obj1.obj2"
}
}
}
Note the _source keyword here to navigate the json-like model.
It’s important to understand the difference between
doc['my_field'].value and _source.my_field. The first, using the doc
keyword, will cause the terms for that field to be loaded to memory
(cached), which will result in faster execution, but more memory
consumption. Also, the doc[...] notation only allows for simple valued
fields (can’t return a json object from it) and make sense only on
non-analyzed or single term based fields.
The _source on the other hand causes the source to be loaded, parsed,
and then only the relevant part of the json is returned.
36.7. Field Data Fields
Allows to return the field data representation of a field for each hit, for example:
{
"query" : {
...
},
"fielddata_fields" : ["test1", "test2"]
}
Field data fields can work on fields that are not stored.
It’s important to understand that using the fielddata_fields parameter will
cause the terms for that field to be loaded to memory (cached), which will
result in more memory consumption.
36.8. Post filter
The post_filter is applied to the search hits at the very end of a search
request, after aggregations have already been calculated. Its purpose is
best explained by example:
Imagine that you are selling shirts, and the user has specified two filters:
color:red and brand:gucci. You only want to show them red shirts made by
Gucci in the search results. Normally you would do this with a
bool query:
curl -XGET localhost:9200/shirts/_search -d '
{
"query": {
"bool": {
"filter": [
{ "term": { "color": "red" }},
{ "term": { "brand": "gucci" }}
]
}
}
}
'
However, you would also like to use faceted navigation to display a list of
other options that the user could click on. Perhaps you have a model field
that would allow the user to limit their search results to red Gucci
t-shirts or dress-shirts.
This can be done with a
terms aggregation:
curl -XGET localhost:9200/shirts/_search -d '
{
"query": {
"bool": {
"filter": [
{ "term": { "color": "red" }},
{ "term": { "brand": "gucci" }}
]
}
},
"aggs": {
"models": {
"terms": { "field": "model" }
}
}
}
'
| Returns the most popular models of red shirts by Gucci. |
But perhaps you would also like to tell the user how many Gucci shirts are
available in other colors. If you just add a terms aggregation on the
color field, you will only get back the color red, because your query
returns only red shirts by Gucci.
Instead, you want to include shirts of all colors during aggregation, then
apply the colors filter only to the search results. This is the purpose of
the post_filter:
curl -XGET localhost:9200/shirts/_search -d '
{
"query": {
"bool": {
"filter": {
{ "term": { "brand": "gucci" }}
}
}
},
"aggs": {
"colors": {
"terms": { "field": "color" }
},
"color_red": {
"filter": {
"term": { "color": "red" }
},
"aggs": {
"models": {
"terms": { "field": "model" }
}
}
}
},
"post_filter": {
"term": { "color": "red" }
}
}
'
| The main query now finds all shirts by Gucci, regardless of color. | |
The colors agg returns popular colors for shirts by Gucci. |
|
The color_red agg limits the models sub-aggregation
to red Gucci shirts. |
|
Finally, the post_filter removes colors other than red
from the search hits. |
36.9. Highlighting
Allows to highlight search results on one or more fields. The
implementation uses either the lucene highlighter, fast-vector-highlighter
or postings-highlighter. The following is an example of the search request
body:
{
"query" : {...},
"highlight" : {
"fields" : {
"content" : {}
}
}
}
In the above case, the content field will be highlighted for each
search hit (there will be another element in each search hit, called
highlight, which includes the highlighted fields and the highlighted
fragments).
|
|
In order to perform highlighting, the actual content of the field is
required. If the field in question is stored (has The |
The field name supports wildcard notation. For example, using comment_*
will cause all fields that match the expression to be highlighted.
36.9.1. Plain highlighter
The default choice of highlighter is of type plain and uses the Lucene highlighter.
It tries hard to reflect the query matching logic in terms of understanding word importance and any word positioning criteria in phrase queries.
|
|
If you want to highlight a lot of fields in a lot of documents with complex queries this highlighter will not be fast. In its efforts to accurately reflect query logic it creates a tiny in-memory index and re-runs the original query criteria through Lucene’s query execution planner to get access to low-level match information on the current document. This is repeated for every field and every document that needs highlighting. If this presents a performance issue in your system consider using an alternative highlighter. |
36.9.2. Postings highlighter
If index_options is set to offsets in the mapping the postings highlighter
will be used instead of the plain highlighter. The postings highlighter:
-
Is faster since it doesn’t require to reanalyze the text to be highlighted: the larger the documents the better the performance gain should be
-
Requires less disk space than term_vectors, needed for the fast vector highlighter
-
Breaks the text into sentences and highlights them. Plays really well with natural languages, not as well with fields containing for instance html markup
-
Treats the document as the whole corpus, and scores individual sentences as if they were documents in this corpus, using the BM25 algorithm
Here is an example of setting the content field to allow for
highlighting using the postings highlighter on it:
{
"type_name" : {
"content" : {"index_options" : "offsets"}
}
}
|
|
Note that the postings highlighter is meant to perform simple query terms highlighting, regardless of their positions. That means that when used for instance in combination with a phrase query, it will highlight all the terms that the query is composed of, regardless of whether they are actually part of a query match, effectively ignoring their positions. |
|
|
The postings highlighter doesn’t support highlighting some complex queries,
like a match query with type set to match_phrase_prefix. No highlighted
snippets will be returned in that case.
|
36.9.3. Fast vector highlighter
If term_vector information is provided by setting term_vector to
with_positions_offsets in the mapping then the fast vector highlighter
will be used instead of the plain highlighter. The fast vector highlighter:
-
Is faster especially for large fields (>
1MB) -
Can be customized with
boundary_chars,boundary_max_scan, andfragment_offset(see below) -
Requires setting
term_vectortowith_positions_offsetswhich increases the size of the index -
Can combine matches from multiple fields into one result. See
matched_fields -
Can assign different weights to matches at different positions allowing for things like phrase matches being sorted above term matches when highlighting a Boosting Query that boosts phrase matches over term matches
Here is an example of setting the content field to allow for
highlighting using the fast vector highlighter on it (this will cause
the index to be bigger):
{
"type_name" : {
"content" : {"term_vector" : "with_positions_offsets"}
}
}
36.9.4. Force highlighter type
The type field allows to force a specific highlighter type. This is useful
for instance when needing to use the plain highlighter on a field that has
term_vectors enabled. The allowed values are: plain, postings and fvh.
The following is an example that forces the use of the plain highlighter:
{
"query" : {...},
"highlight" : {
"fields" : {
"content" : {"type" : "plain"}
}
}
}
36.9.5. Force highlighting on source
Forces the highlighting to highlight fields based on the source even if fields are
stored separately. Defaults to false.
{
"query" : {...},
"highlight" : {
"fields" : {
"content" : {"force_source" : true}
}
}
}
36.9.6. Highlighting Tags
By default, the highlighting will wrap highlighted text in <em> and
</em>. This can be controlled by setting pre_tags and post_tags,
for example:
{
"query" : {...},
"highlight" : {
"pre_tags" : ["<tag1>"],
"post_tags" : ["</tag1>"],
"fields" : {
"_all" : {}
}
}
}
Using the fast vector highlighter there can be more tags, and the "importance" is ordered.
{
"query" : {...},
"highlight" : {
"pre_tags" : ["<tag1>", "<tag2>"],
"post_tags" : ["</tag1>", "</tag2>"],
"fields" : {
"_all" : {}
}
}
}
There are also built in "tag" schemas, with currently a single schema
called styled with the following pre_tags:
<em class="hlt1">, <em class="hlt2">, <em class="hlt3">,
<em class="hlt4">, <em class="hlt5">, <em class="hlt6">,
<em class="hlt7">, <em class="hlt8">, <em class="hlt9">,
<em class="hlt10">
and </em> as post_tags. If you think of more nice to have built in tag
schemas, just send an email to the mailing list or open an issue. Here
is an example of switching tag schemas:
{
"query" : {...},
"highlight" : {
"tags_schema" : "styled",
"fields" : {
"content" : {}
}
}
}
36.9.7. Encoder
An encoder parameter can be used to define how highlighted text will
be encoded. It can be either default (no encoding) or html (will
escape html, if you use html highlighting tags).
36.9.8. Highlighted Fragments
Each field highlighted can control the size of the highlighted fragment
in characters (defaults to 100), and the maximum number of fragments
to return (defaults to 5).
For example:
{
"query" : {...},
"highlight" : {
"fields" : {
"content" : {"fragment_size" : 150, "number_of_fragments" : 3}
}
}
}
The fragment_size is ignored when using the postings highlighter, as it
outputs sentences regardless of their length.
On top of this it is possible to specify that highlighted fragments need to be sorted by score:
{
"query" : {...},
"highlight" : {
"order" : "score",
"fields" : {
"content" : {"fragment_size" : 150, "number_of_fragments" : 3}
}
}
}
If the number_of_fragments value is set to 0 then no fragments are
produced, instead the whole content of the field is returned, and of
course it is highlighted. This can be very handy if short texts (like
document title or address) need to be highlighted but no fragmentation
is required. Note that fragment_size is ignored in this case.
{
"query" : {...},
"highlight" : {
"fields" : {
"_all" : {},
"bio.title" : {"number_of_fragments" : 0}
}
}
}
When using fast-vector-highlighter one can use fragment_offset
parameter to control the margin to start highlighting from.
In the case where there is no matching fragment to highlight, the default is
to not return anything. Instead, we can return a snippet of text from the
beginning of the field by setting no_match_size (default 0) to the length
of the text that you want returned. The actual length may be shorter than
specified as it tries to break on a word boundary. When using the postings
highlighter it is not possible to control the actual size of the snippet,
therefore the first sentence gets returned whenever no_match_size is
greater than 0.
{
"query" : {...},
"highlight" : {
"fields" : {
"content" : {
"fragment_size" : 150,
"number_of_fragments" : 3,
"no_match_size": 150
}
}
}
}
36.9.9. Highlight query
It is also possible to highlight against a query other than the search
query by setting highlight_query. This is especially useful if you
use a rescore query because those are not taken into account by
highlighting by default. Elasticsearch does not validate that
highlight_query contains the search query in any way so it is possible
to define it so legitimate query results aren’t highlighted at all.
Generally it is better to include the search query in the
highlight_query. Here is an example of including both the search
query and the rescore query in highlight_query.
{
"fields": [ "_id" ],
"query" : {
"match": {
"content": {
"query": "foo bar"
}
}
},
"rescore": {
"window_size": 50,
"query": {
"rescore_query" : {
"match_phrase": {
"content": {
"query": "foo bar",
"phrase_slop": 1
}
}
},
"rescore_query_weight" : 10
}
},
"highlight" : {
"order" : "score",
"fields" : {
"content" : {
"fragment_size" : 150,
"number_of_fragments" : 3,
"highlight_query": {
"bool": {
"must": {
"match": {
"content": {
"query": "foo bar"
}
}
},
"should": {
"match_phrase": {
"content": {
"query": "foo bar",
"phrase_slop": 1,
"boost": 10.0
}
}
},
"minimum_should_match": 0
}
}
}
}
}
}
Note that the score of text fragment in this case is calculated by the Lucene
highlighting framework. For implementation details you can check the
ScoreOrderFragmentsBuilder.java class. On the other hand when using the
postings highlighter the fragments are scored using, as mentioned above,
the BM25 algorithm.
36.9.10. Global Settings
Highlighting settings can be set on a global level and then overridden at the field level.
{
"query" : {...},
"highlight" : {
"number_of_fragments" : 3,
"fragment_size" : 150,
"tag_schema" : "styled",
"fields" : {
"_all" : { "pre_tags" : ["<em>"], "post_tags" : ["</em>"] },
"bio.title" : { "number_of_fragments" : 0 },
"bio.author" : { "number_of_fragments" : 0 },
"bio.content" : { "number_of_fragments" : 5, "order" : "score" }
}
}
}
36.9.11. Require Field Match
require_field_match can be set to false which will cause any field to
be highlighted regardless of whether the query matched specifically on them.
The default behaviour is true, meaning that only fields that hold a query
match will be highlighted.
{
"query" : {...},
"highlight" : {
"require_field_match": false
"fields" : {...}
}
}
36.9.12. Boundary Characters
When highlighting a field using the fast vector highlighter,
boundary_chars can be configured to define what constitutes a boundary
for highlighting. It’s a single string with each boundary character
defined in it. It defaults to .,!? \t\n.
The boundary_max_scan allows to control how far to look for boundary
characters, and defaults to 20.
36.9.13. Matched Fields
The Fast Vector Highlighter can combine matches on multiple fields to
highlight a single field using matched_fields. This is most
intuitive for multifields that analyze the same string in different
ways. All matched_fields must have term_vector set to
with_positions_offsets but only the field to which the matches are
combined is loaded so only that field would benefit from having
store set to yes.
In the following examples content is analyzed by the english
analyzer and content.plain is analyzed by the standard analyzer.
{
"query": {
"query_string": {
"query": "content.plain:running scissors",
"fields": ["content"]
}
},
"highlight": {
"order": "score",
"fields": {
"content": {
"matched_fields": ["content", "content.plain"],
"type" : "fvh"
}
}
}
}
The above matches both "run with scissors" and "running with scissors" and would highlight "running" and "scissors" but not "run". If both phrases appear in a large document then "running with scissors" is sorted above "run with scissors" in the fragments list because there are more matches in that fragment.
{
"query": {
"query_string": {
"query": "running scissors",
"fields": ["content", "content.plain^10"]
}
},
"highlight": {
"order": "score",
"fields": {
"content": {
"matched_fields": ["content", "content.plain"],
"type" : "fvh"
}
}
}
}
The above highlights "run" as well as "running" and "scissors" but still sorts "running with scissors" above "run with scissors" because the plain match ("running") is boosted.
{
"query": {
"query_string": {
"query": "running scissors",
"fields": ["content", "content.plain^10"]
}
},
"highlight": {
"order": "score",
"fields": {
"content": {
"matched_fields": ["content.plain"],
"type" : "fvh"
}
}
}
}
The above query wouldn’t highlight "run" or "scissor" but shows that
it is just fine not to list the field to which the matches are combined
(content) in the matched fields.
|
|
Technically it is also fine to add fields to matched_fields that
don’t share the same underlying string as the field to which the matches
are combined. The results might not make much sense and if one of the
matches is off the end of the text then the whole query will fail.
|
|
|
There is a small amount of overhead involved with setting
to
|
36.9.14. Phrase Limit
The fast-vector-highlighter has a phrase_limit parameter that prevents
it from analyzing too many phrases and eating tons of memory. It defaults
to 256 so only the first 256 matching phrases in the document scored
considered. You can raise the limit with the phrase_limit parameter but
keep in mind that scoring more phrases consumes more time and memory.
If using matched_fields keep in mind that phrase_limit phrases per
matched field are considered.
Field Highlight Order
Elasticsearch highlights the fields in the order that they are sent. Per the
json spec objects are unordered but if you need to be explicit about the order
that fields are highlighted then you can use an array for fields like this:
"highlight": {
"fields": [
{"title":{ /*params*/ }},
{"text":{ /*params*/ }}
]
}
None of the highlighters built into Elasticsearch care about the order that the fields are highlighted but a plugin may.
36.10. Rescoring
Rescoring can help to improve precision by reordering just the top (eg
100 - 500) documents returned by the
query and
post_filter phases, using a
secondary (usually more costly) algorithm, instead of applying the
costly algorithm to all documents in the index.
A rescore request is executed on each shard before it returns its
results to be sorted by the node handling the overall search request.
Currently the rescore API has only one implementation: the query rescorer, which uses a query to tweak the scoring. In the future, alternative rescorers may be made available, for example, a pair-wise rescorer.
|
|
the rescore phase is not executed when
search_type is set
to scan or count.
|
|
|
when exposing pagination to your users, you should not change
window_size as you step through each page (by passing different
from values) since that can alter the top hits causing results to
confusingly shift as the user steps through pages.
|
36.10.1. Query rescorer
The query rescorer executes a second query only on the Top-K results
returned by the query and
post_filter phases. The
number of docs which will be examined on each shard can be controlled by
the window_size parameter, which defaults to
from and size.
By default the scores from the original query and the rescore query are
combined linearly to produce the final _score for each document. The
relative importance of the original query and of the rescore query can
be controlled with the query_weight and rescore_query_weight
respectively. Both default to 1.
For example:
curl -s -XPOST 'localhost:9200/_search' -d '{
"query" : {
"match" : {
"field1" : {
"operator" : "or",
"query" : "the quick brown",
"type" : "boolean"
}
}
},
"rescore" : {
"window_size" : 50,
"query" : {
"rescore_query" : {
"match" : {
"field1" : {
"query" : "the quick brown",
"type" : "phrase",
"slop" : 2
}
}
},
"query_weight" : 0.7,
"rescore_query_weight" : 1.2
}
}
}
'
The way the scores are combined can be controlled with the score_mode:
| Score Mode | Description |
|---|---|
|
Add the original score and the rescore query score. The default. |
|
Multiply the original score by the rescore query score. Useful
for |
|
Average the original score and the rescore query score. |
|
Take the max of original score and the rescore query score. |
|
Take the min of the original score and the rescore query score. |
36.10.2. Multiple Rescores
It is also possible to execute multiple rescores in sequence:
curl -s -XPOST 'localhost:9200/_search' -d '{
"query" : {
"match" : {
"field1" : {
"operator" : "or",
"query" : "the quick brown",
"type" : "boolean"
}
}
},
"rescore" : [ {
"window_size" : 100,
"query" : {
"rescore_query" : {
"match" : {
"field1" : {
"query" : "the quick brown",
"type" : "phrase",
"slop" : 2
}
}
},
"query_weight" : 0.7,
"rescore_query_weight" : 1.2
}
}, {
"window_size" : 10,
"query" : {
"score_mode": "multiply",
"rescore_query" : {
"function_score" : {
"script_score": {
"script": "log10(doc['numeric'].value + 2)"
}
}
}
}
} ]
}
'
The first one gets the results of the query then the second one gets the results of the first, etc. The second rescore will "see" the sorting done by the first rescore so it is possible to use a large window on the first rescore to pull documents into a smaller window for the second rescore.
36.11. Search Type
There are different execution paths that can be done when executing a distributed search. The distributed search operation needs to be scattered to all the relevant shards and then all the results are gathered back. When doing scatter/gather type execution, there are several ways to do that, specifically with search engines.
One of the questions when executing a distributed search is how much results to retrieve from each shard. For example, if we have 10 shards, the 1st shard might hold the most relevant results from 0 till 10, with other shards results ranking below it. For this reason, when executing a request, we will need to get results from 0 till 10 from all shards, sort them, and then return the results if we want to ensure correct results.
Another question, which relates to the search engine, is the fact that each shard stands on its own. When a query is executed on a specific shard, it does not take into account term frequencies and other search engine information from the other shards. If we want to support accurate ranking, we would need to first gather the term frequencies from all shards to calculate global term frequencies, then execute the query on each shard using these global frequencies.
Also, because of the need to sort the results, getting back a large
document set, or even scrolling it, while maintaining the correct sorting
behavior can be a very expensive operation. For large result set
scrolling, it is best to sort by _doc if the order in which documents
are returned is not important.
Elasticsearch is very flexible and allows to control the type of search to execute on a per search request basis. The type can be configured by setting the search_type parameter in the query string. The types are:
36.11.1. Query Then Fetch
Parameter value: query_then_fetch.
The request is processed in two phases. In the first phase, the query
is forwarded to all involved shards. Each shard executes the search
and generates a sorted list of results, local to that shard. Each
shard returns just enough information to the coordinating node
to allow it merge and re-sort the shard level results into a globally
sorted set of results, of maximum length size.
During the second phase, the coordinating node requests the document content (and highlighted snippets, if any) from only the relevant shards.
|
|
This is the default setting, if you do not specify a search_type
in your request.
|
36.11.2. Dfs, Query Then Fetch
Parameter value: dfs_query_then_fetch.
Same as "Query Then Fetch", except for an initial scatter phase which goes and computes the distributed term frequencies for more accurate scoring.
36.11.3. Count
deprecated[2.0.0-beta1, count does not provide any benefits over query_then_fetch with a size of 0]
Parameter value: count.
A special search type that returns the count that matched the search
request without any docs (represented in total_hits), and possibly,
including aggregations as well. In general, this is preferable to the count
API as it provides more options.
36.12. Scroll
While a search request returns a single “page” of results, the scroll
API can be used to retrieve large numbers of results (or even all results)
from a single search request, in much the same way as you would use a cursor
on a traditional database.
Scrolling is not intended for real time user requests, but rather for processing large amounts of data, e.g. in order to reindex the contents of one index into a new index with a different configuration.
|
|
The results that are returned from a scroll request reflect the state of
the index at the time that the initial search request was made, like a
snapshot in time. Subsequent changes to documents (index, update or delete)
will only affect later search requests.
|
In order to use scrolling, the initial search request should specify the
scroll parameter in the query string, which tells Elasticsearch how long it
should keep the “search context” alive (see Keeping the search context alive), eg ?scroll=1m.
curl -XGET 'localhost:9200/twitter/tweet/_search?scroll=1m' -d '
{
"query": {
"match" : {
"title" : "elasticsearch"
}
}
}
'
The result from the above request includes a _scroll_id, which should
be passed to the scroll API in order to retrieve the next batch of
results.
curl -XGET <1> 'localhost:9200/_search/scroll' <2> -d'
{
"scroll" : "1m",
"scroll_id" : "c2Nhbjs2OzM0NDg1ODpzRlBLc0FXNlNyNm5JWUc1"
}
'
GET or POST can be used. |
|
The URL should not include the index or type name — these
are specified in the original search request instead. |
|
The scroll parameter tells Elasticsearch to keep the search context open
for another 1m. |
|
The scroll_id parameter |
Each call to the scroll API returns the next batch of results until there
are no more results left to return, ie the hits array is empty.
For backwards compatibility, scroll_id and scroll can be passed in the query string.
And the scroll_id can be passed in the request body
curl -XGET 'localhost:9200/_search/scroll?scroll=1m' -d 'c2Nhbjs2OzM0NDg1ODpzRlBLc0FXNlNyNm5JWUc1'
|
|
The initial search request and each subsequent scroll request
returns a new _scroll_id — only the most recent _scroll_id should be
used.
|
|
|
If the request specifies aggregations, only the initial search response will contain the aggregations results. |
|
|
Scroll requests have optimizations that make them faster when the sort
order is _doc. If you want to iterate over all documents regardless of the
order, this is the most efficient option:
|
curl -XGET 'localhost:9200/_search?scroll=1m' -d '
{
"sort": [
"_doc"
]
}
'
36.12.1. Keeping the search context alive
The scroll parameter (passed to the search request and to every scroll
request) tells Elasticsearch how long it should keep the search context alive.
Its value (e.g. 1m, see Time units) does not need to be long enough to
process all data — it just needs to be long enough to process the previous
batch of results. Each scroll request (with the scroll parameter) sets a
new expiry time.
Normally, the background merge process optimizes the index by merging together smaller segments to create new bigger segments, at which time the smaller segments are deleted. This process continues during scrolling, but an open search context prevents the old segments from being deleted while they are still in use. This is how Elasticsearch is able to return the results of the initial search request, regardless of subsequent changes to documents.
|
|
Keeping older segments alive means that more file handles are needed. Ensure that you have configured your nodes to have ample free file handles. See File Descriptors. |
You can check how many search contexts are open with the nodes stats API:
curl -XGET localhost:9200/_nodes/stats/indices/search?pretty
36.12.2. Clear scroll API
Search context are automatically removed when the scroll timeout has been
exceeded. However keeping scrolls open has a cost, as discussed in the
previous section so scrolls should be explicitly
cleared as soon as the scroll is not being used anymore using the
clear-scroll API:
curl -XDELETE localhost:9200/_search/scroll -d '
{
"scroll_id" : ["c2Nhbjs2OzM0NDg1ODpzRlBLc0FXNlNyNm5JWUc1"]
}'
Multiple scroll IDs can be passed as array:
curl -XDELETE localhost:9200/_search/scroll -d '
{
"scroll_id" : ["c2Nhbjs2OzM0NDg1ODpzRlBLc0FXNlNyNm5JWUc1", "aGVuRmV0Y2g7NTsxOnkxaDZ"]
}'
All search contexts can be cleared with the _all parameter:
curl -XDELETE localhost:9200/_search/scroll/_all
The scroll_id can also be passed as a query string parameter or in the request body.
Multiple scroll IDs can be passed as comma separated values:
curl -XDELETE localhost:9200/_search/scroll \
-d 'c2Nhbjs2OzM0NDg1ODpzRlBLc0FXNlNyNm5JWUc1,aGVuRmV0Y2g7NTsxOnkxaDZ'
36.13. Preference
Controls a preference of which shard replicas to execute the search
request on. By default, the operation is randomized between the shard
replicas.
The preference is a query string parameter which can be set to:
_primary
|
The operation will go and be executed only on the primary shards. |
_primary_first
|
The operation will go and be executed on the primary shard, and if not available (failover), will execute on other shards. |
_replica
|
The operation will go and be executed only on a replica shard. |
_replica_first
|
The operation will go and be executed only on a replica shard, and if not available (failover), will execute on other shards. |
_local
|
The operation will prefer to be executed on a local allocated shard if possible. |
_only_node:xyz
|
Restricts the search to execute only on a node with
the provided node id ( |
_prefer_node:xyz
|
Prefers execution on the node with the provided
node id ( |
_shards:2,3
|
Restricts the operation to the specified shards. ( |
_only_nodes
|
Restricts the operation to nodes specified in node specification https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster.html |
| Custom (string) value |
A custom value will be used to guarantee that the same shards will be used for the same custom value. This can help with "jumping values" when hitting different shards in different refresh states. A sample value can be something like the web session id, or the user name. |
For instance, use the user’s session ID to ensure consistent ordering of results for the user:
curl localhost:9200/_search?preference=xyzabc123 -d '
{
"query": {
"match": {
"title": "elasticsearch"
}
}
}
'
36.14. Explain
Enables explanation for each hit on how its score was computed.
{
"explain": true,
"query" : {
"term" : { "user" : "kimchy" }
}
}
36.15. Version
Returns a version for each search hit.
{
"version": true,
"query" : {
"term" : { "user" : "kimchy" }
}
}
36.16. Index Boost
Allows to configure different boost level per index when searching across more than one indices. This is very handy when hits coming from one index matter more than hits coming from another index (think social graph where each user has an index).
{
"indices_boost" : {
"index1" : 1.4,
"index2" : 1.3
}
}
36.17. min_score
Exclude documents which have a _score less than the minimum specified
in min_score:
{
"min_score": 0.5,
"query" : {
"term" : { "user" : "kimchy" }
}
}
Note, most times, this does not make much sense, but is provided for advanced use cases.
36.18. Named Queries
Each filter and query can accept a _name in its top level definition.
{
"bool" : {
"should" : [
{"match" : { "name.first" : {"query" : "shay", "_name" : "first"} }},
{"match" : { "name.last" : {"query" : "banon", "_name" : "last"} }}
],
"filter" : {
"terms" : {
"name.last" : ["banon", "kimchy"],
"_name" : "test"
}
}
}
}
The search response will include for each hit the matched_queries it matched on. The tagging of queries and filters
only make sense for the bool query.
36.19. Inner hits
The parent/child and nested features allow the return of documents that have matches in a different scope. In the parent/child case, parent document are returned based on matches in child documents or child document are returned based on matches in parent documents. In the nested case, documents are returned based on matches in nested inner objects.
In both cases, the actual matches in the different scopes that caused a document to be returned is hidden. In many cases, it’s very useful to know which inner nested objects (in the case of nested) or children/parent documents (in the case of parent/child) caused certain information to be returned. The inner hits feature can be used for this. This feature returns per search hit in the search response additional nested hits that caused a search hit to match in a different scope.
Inner hits can be used by defining an inner_hits definition on a nested, has_child or has_parent query and filter.
The structure looks like this:
"<query>" : {
"inner_hits" : {
<inner_hits_options>
}
}
If _inner_hits is defined on a query that supports it then each search hit will contain an inner_hits json object with the following structure:
"hits": [
{
"_index": ...,
"_type": ...,
"_id": ...,
"inner_hits": {
"<inner_hits_name>": {
"hits": {
"total": ...,
"hits": [
{
"_type": ...,
"_id": ...,
...
},
...
]
}
}
},
...
},
...
]
36.19.1. Options
Inner hits support the following options:
from
|
The offset from where the first hit to fetch for each |
size
|
The maximum number of hits to return per |
sort
|
How the inner hits should be sorted per |
name
|
The name to be used for the particular inner hit definition in the response. Useful when multiple inner hits
have been defined in a single search request. The default depends in which query the inner hit is defined.
For |
Inner hits also supports the following per document features:
36.19.2. Nested inner hits
The nested inner_hits can be used to include nested inner objects as inner hits to a search hit.
The example below assumes that there is a nested object field defined with the name comments:
{
"query" : {
"nested" : {
"path" : "comments",
"query" : {
"match" : {"comments.message" : "[actual query]"}
},
"inner_hits" : {}
}
}
}
| The inner hit definition in the nested query. No other options need to be defined. |
An example of a response snippet that could be generated from the above search request:
...
"hits": {
...
"hits": [
{
"_index": "my-index",
"_type": "question",
"_id": "1",
"_source": ...,
"inner_hits": {
"comments": {
"hits": {
"total": ...,
"hits": [
{
"_type": "question",
"_id": "1",
"_nested": {
"field": "comments",
"offset": 2
},
"_source": ...
},
...
]
}
}
}
},
...
The name used in the inner hit definition in the search request. A custom key can be used via the name option. |
The _nested metadata is crucial in the above example, because it defines from what inner nested object this inner hit
came from. The field defines the object array field the nested hit is from and the offset relative to its location
in the _source. Due to sorting and scoring the actual location of the hit objects in the inner_hits is usually
different than the location a nested inner object was defined.
By default the _source is returned also for the hit objects in inner_hits, but this can be changed. Either via
_source filtering feature part of the source can be returned or be disabled. If stored fields are defined on the
nested level these can also be returned via the fields feature.
An important default is that the _source returned in hits inside inner_hits is relative to the _nested metadata.
So in the above example only the comment part is returned per nested hit and not the entire source of the top level
document that contained the comment.
|
|
A bug in Elasticsearch 2.x means that if you explicitly specify fields to be returned as part of the _source for inner_hits, you need to define them using the relative path, so in the example above you must write: |
"inner_hits" : {
"_source":["message"]
}
If you return field data using fielddata_fields, you need to specify the full path instead.
36.19.3. Hierarchical levels of nested object fields and inner hits.
If a mapping has multiple levels of hierarchical nested object fields each level can be accessed using Top level inner hits (see below).
36.19.4. Parent/child inner hits
The parent/child inner_hits can be used to include parent or child
The examples below assumes that there is a _parent field mapping in the comment type:
{
"query" : {
"has_child" : {
"type" : "comment",
"query" : {
"match" : {"message" : "[actual query]"}
},
"inner_hits" : {}
}
}
}
| The inner hit definition like in the nested example. |
An example of a response snippet that could be generated from the above search request:
...
"hits": {
...
"hits": [
{
"_index": "my-index",
"_type": "question",
"_id": "1",
"_source": ...,
"inner_hits": {
"comment": {
"hits": {
"total": ...,
"hits": [
{
"_type": "comment",
"_id": "5",
"_source": ...
},
...
]
}
}
}
},
...
36.19.5. Top level inner hits
Besides defining inner hits on query and filters, inner hits can also be defined as a top level construct alongside the
query and aggregations definition. The main reason for using the top level inner hits definition is to let the
inner hits return documents that don’t match with the main query. Also inner hits definitions can be nested via the
top level notation. Other than that, the inner hit definition inside the query should be used because that is the most
compact way for defining inner hits.
The following snippet explains the basic structure of inner hits defined at the top level of the search request body:
"inner_hits" : {
"<inner_hits_name>" : {
"<path|type>" : {
"<path-to-nested-object-field|child-or-parent-type>" : {
<inner_hits_body>
[,"inner_hits" : { [<sub_inner_hits>]+ } ]?
}
}
}
[,"<inner_hits_name_2>" : { ... } ]*
}
Inside the inner_hits definition, first the name of the inner hit is defined then whether the inner_hit
is a nested by defining path or a parent/child based definition by defining type. The next object layer contains
the name of the nested object field if the inner_hits is nested or the parent or child type if the inner_hit definition
is parent/child based.
Multiple inner hit definitions can be defined in a single request. In the <inner_hits_body> any option for features
that inner_hits support can be defined. Optionally another inner_hits definition can be defined in the <inner_hits_body>.
An example that shows the use of nested inner hits via the top level notation:
{
"query" : {
"nested" : {
"path" : "comments",
"query" : {
"match" : {"comments.message" : "[actual query]"}
}
}
},
"inner_hits" : {
"comment" : {
"path" : {
"comments" : {
"query" : {
"match" : {"comments.message" : "[different query]"}
}
}
}
}
}
}
The inner hit definition is nested and requires the path option. |
|
The path option refers to the nested object field comments |
|
| A query that runs to collect the nested inner documents for each search hit returned. If no query is defined all nested inner documents will be included belonging to a search hit. This shows that it only make sense to the top level inner hit definition if no query or a different query is specified. |
Additional options that are only available when using the top level inner hits notation:
path
|
Defines the nested scope where hits will be collected from. |
type
|
Defines the parent or child type score where hits will be collected from. |
query
|
Defines the query that will run in the defined nested, parent or child scope to collect and score hits. By default all document in the scope will be matched. |
Either path or type must be defined. The path or type defines the scope from where hits are fetched and
used as inner hits.
37. Search Template
The /_search/template endpoint allows to use the mustache language to pre render search requests,
before they are executed and fill existing templates with template parameters.
GET /_search/template
{
"inline" : {
"query": { "match" : { "{{my_field}}" : "{{my_value}}" } },
"size" : "{{my_size}}"
},
"params" : {
"my_field" : "foo",
"my_value" : "bar",
"my_size" : 5
}
}
For more information on how Mustache templating and what kind of templating you can do with it check out the online documentation of the mustache project.
|
|
The mustache language is implemented in elasticsearch as a sandboxed scripting language, hence it obeys settings that may be used to enable or disable scripts per language, source and operation as described in scripting docs |
More template examples
Filling in a query string with a single value
GET /_search/template
{
"inline": {
"query": {
"match": {
"title": "{{query_string}}"
}
}
},
"params": {
"query_string": "search for these words"
}
}
Passing an array of strings
GET /_search/template
{
"inline": {
"query": {
"terms": {
"status": [
"{{#status}}",
"{{.}}",
"{{/status}}"
]
}
}
},
"params": {
"status": [ "pending", "published" ]
}
}
which is rendered as:
{
"query": {
"terms": {
"status": [ "pending", "published" ]
}
}
Default values
A default value is written as {{var}}{{^var}}default{{/var}} for instance:
{
"inline": {
"query": {
"range": {
"line_no": {
"gte": "{{start}}",
"lte": "{{end}}{{^end}}20{{/end}}"
}
}
}
},
"params": { ... }
}
When params is { "start": 10, "end": 15 } this query would be rendered as:
{
"range": {
"line_no": {
"gte": "10",
"lte": "15"
}
}
}
But when params is { "start": 10 } this query would use the default value
for end:
{
"range": {
"line_no": {
"gte": "10",
"lte": "20"
}
}
}
Conditional clauses
Conditional clauses cannot be expressed using the JSON form of the template.
Instead, the template must be passed as a string. For instance, let’s say
we wanted to run a match query on the line field, and optionally wanted
to filter by line numbers, where start and end are optional.
The params would look like:
{
"params": {
"text": "words to search for",
"line_no": {
"start": 10,
"end": 20
}
}
}
| All three of these elements are optional. |
We could write the query as:
{
"query": {
"bool": {
"must": {
"match": {
"line": "{{text}}"
}
},
"filter": {
{{#line_no}}
"range": {
"line_no": {
{{#start}}
"gte": "{{start}}"
{{#end}},{{/end}}
{{/start}}
{{#end}}
"lte": "{{end}}"
{{/end}}
}
}
{{/line_no}}
}
}
}
}
Fill in the value of param text |
|
Include the range filter only if line_no is specified |
|
Include the gte clause only if line_no.start is specified |
|
Fill in the value of param line_no.start |
|
Add a comma after the gte clause only if line_no.start
AND line_no.end are specified |
|
Include the lte clause only if line_no.end is specified |
|
Fill in the value of param line_no.end |
|
|
As written above, this template is not valid JSON because it includes the
section markers like
|
Pre-registered template
You can register search templates by storing it in the config/scripts directory, in a file using the .mustache extension.
In order to execute the stored template, reference it by it’s name under the template key:
GET /_search/template
{
"file": "storedTemplate",
"params": {
"query_string": "search for these words"
}
}
Name of the query template in config/scripts/, i.e., storedTemplate.mustache. |
You can also register search templates by storing it in the elasticsearch cluster in a special index named .scripts.
There are REST APIs to manage these indexed templates.
POST /_search/template/<templatename>
{
"template": {
"query": {
"match": {
"title": "{{query_string}}"
}
}
}
}
This template can be retrieved by
GET /_search/template/<templatename>
which is rendered as:
{
"template": {
"query": {
"match": {
"title": "{{query_string}}"
}
}
}
}
This template can be deleted by
DELETE /_search/template/<templatename>
To use an indexed template at search time use:
GET /_search/template
{
"id": "templateName",
"params": {
"query_string": "search for these words"
}
}
Name of the query template stored in the .scripts index. |
Validating templates
A template can be rendered in a response with given parameters using
GET /_render/template
{
"inline": {
"query": {
"terms": {
"status": [
"{{#status}}",
"{{.}}",
"{{/status}}"
]
}
}
},
"params": {
"status": [ "pending", "published" ]
}
}
This call will return the rendered template:
{
"template_output": {
"query": {
"terms": {
"status": [
"pending",
"published"
]
}
}
}
}
status array has been populated with values from the params object. |
File and indexed templates can also be rendered by replacing inline with
file or id respectively. For example, to render a file template
GET /_render/template
{
"file": "my_template",
"params": {
"status": [ "pending", "published" ]
}
}
Pre-registered templates can also be rendered using
GET /_render/template/<template_name>
{
"params": {
"..."
}
}
38. Search Shards API
The search shards api returns the indices and shards that a search request would be executed against. This can give useful feedback for working out issues or planning optimizations with routing and shard preferences.
The index and type parameters may be single values, or comma-separated.
Usage
Full example:
curl -XGET 'localhost:9200/twitter/_search_shards'
This will yield the following result:
{
"nodes": {
"JklnKbD7Tyqi9TP3_Q_tBg": {
"name": "Rl'nnd",
"transport_address": "inet[/192.168.1.113:9300]"
}
},
"shards": [
[
{
"index": "twitter",
"node": "JklnKbD7Tyqi9TP3_Q_tBg",
"primary": true,
"relocating_node": null,
"shard": 3,
"state": "STARTED"
}
],
[
{
"index": "twitter",
"node": "JklnKbD7Tyqi9TP3_Q_tBg",
"primary": true,
"relocating_node": null,
"shard": 4,
"state": "STARTED"
}
],
[
{
"index": "twitter",
"node": "JklnKbD7Tyqi9TP3_Q_tBg",
"primary": true,
"relocating_node": null,
"shard": 0,
"state": "STARTED"
}
],
[
{
"index": "twitter",
"node": "JklnKbD7Tyqi9TP3_Q_tBg",
"primary": true,
"relocating_node": null,
"shard": 2,
"state": "STARTED"
}
],
[
{
"index": "twitter",
"node": "JklnKbD7Tyqi9TP3_Q_tBg",
"primary": true,
"relocating_node": null,
"shard": 1,
"state": "STARTED"
}
]
]
}
And specifying the same request, this time with a routing value:
curl -XGET 'localhost:9200/twitter/_search_shards?routing=foo,baz'
This will yield the following result:
{
"nodes": {
"JklnKbD7Tyqi9TP3_Q_tBg": {
"name": "Rl'nnd",
"transport_address": "inet[/192.168.1.113:9300]"
}
},
"shards": [
[
{
"index": "twitter",
"node": "JklnKbD7Tyqi9TP3_Q_tBg",
"primary": true,
"relocating_node": null,
"shard": 2,
"state": "STARTED"
}
],
[
{
"index": "twitter",
"node": "JklnKbD7Tyqi9TP3_Q_tBg",
"primary": true,
"relocating_node": null,
"shard": 4,
"state": "STARTED"
}
]
]
}
This time the search will only be executed against two of the shards, because routing values have been specified.
All parameters:
routing
|
A comma-separated list of routing values to take into account when determining which shards a request would be executed against. |
preference
|
Controls a |
local
|
A boolean value whether to read the cluster state locally in order to determine where shards are allocated instead of using the Master node’s cluster state. |
39. Suggesters
The suggest feature suggests similar looking terms based on a provided text by using a suggester. Parts of the suggest feature are still under development.
The suggest request part is either defined alongside the query part in a
_search request or via the REST _suggest endpoint.
curl -s -XPOST 'localhost:9200/_search' -d '{
"query" : {
...
},
"suggest" : {
...
}
}'
Suggest requests executed against the _suggest endpoint should omit
the surrounding suggest element which is only used if the suggest
request is part of a search.
curl -XPOST 'localhost:9200/_suggest' -d '{
"my-suggestion" : {
"text" : "the amsterdma meetpu",
"term" : {
"field" : "body"
}
}
}'
Several suggestions can be specified per request. Each suggestion is
identified with an arbitrary name. In the example below two suggestions
are requested. Both my-suggest-1 and my-suggest-2 suggestions use
the term suggester, but have a different text.
"suggest" : {
"my-suggest-1" : {
"text" : "the amsterdma meetpu",
"term" : {
"field" : "body"
}
},
"my-suggest-2" : {
"text" : "the rottredam meetpu",
"term" : {
"field" : "title"
}
}
}
The below suggest response example includes the suggestion response for
my-suggest-1 and my-suggest-2. Each suggestion part contains
entries. Each entry is effectively a token from the suggest text and
contains the suggestion entry text, the original start offset and length
in the suggest text and if found an arbitrary number of options.
{
...
"suggest": {
"my-suggest-1": [
{
"text" : "amsterdma",
"offset": 4,
"length": 9,
"options": [
...
]
},
...
],
"my-suggest-2" : [
...
]
}
...
}
Each options array contains an option object that includes the suggested text, its document frequency and score compared to the suggest entry text. The meaning of the score depends on the used suggester. The term suggester’s score is based on the edit distance.
"options": [
{
"text": "amsterdam",
"freq": 77,
"score": 0.8888889
},
...
]
Global suggest text
To avoid repetition of the suggest text, it is possible to define a
global text. In the example below the suggest text is defined globally
and applies to the my-suggest-1 and my-suggest-2 suggestions.
"suggest" : {
"text" : "the amsterdma meetpu",
"my-suggest-1" : {
"term" : {
"field" : "title"
}
},
"my-suggest-2" : {
"term" : {
"field" : "body"
}
}
}
The suggest text can in the above example also be specified as suggestion specific option. The suggest text specified on suggestion level override the suggest text on the global level.
Other suggest example
In the below example we request suggestions for the following suggest
text: devloping distibutd saerch engies on the title field with a
maximum of 3 suggestions per term inside the suggest text. Note that in
this example we set size to 0. This isn’t required, but a
nice optimization. The suggestions are gathered in the query phase and
in the case that we only care about suggestions (so no hits) we don’t
need to execute the fetch phase.
curl -s -XPOST 'localhost:9200/_search' -d '{
"size": 0,
"suggest" : {
"my-title-suggestions-1" : {
"text" : "devloping distibutd saerch engies",
"term" : {
"size" : 3,
"field" : "title"
}
}
}
}'
The above request could yield the response as stated in the code example
below. As you can see if we take the first suggested options of each
suggestion entry we get developing distributed search engines as
result.
{
...
"suggest": {
"my-title-suggestions-1": [
{
"text": "devloping",
"offset": 0,
"length": 9,
"options": [
{
"text": "developing",
"freq": 77,
"score": 0.8888889
},
{
"text": "deloping",
"freq": 1,
"score": 0.875
},
{
"text": "deploying",
"freq": 2,
"score": 0.7777778
}
]
},
{
"text": "distibutd",
"offset": 10,
"length": 9,
"options": [
{
"text": "distributed",
"freq": 217,
"score": 0.7777778
},
{
"text": "disributed",
"freq": 1,
"score": 0.7777778
},
{
"text": "distribute",
"freq": 1,
"score": 0.7777778
}
]
},
{
"text": "saerch",
"offset": 20,
"length": 6,
"options": [
{
"text": "search",
"freq": 1038,
"score": 0.8333333
},
{
"text": "smerch",
"freq": 3,
"score": 0.8333333
},
{
"text": "serch",
"freq": 2,
"score": 0.8
}
]
},
{
"text": "engies",
"offset": 27,
"length": 6,
"options": [
{
"text": "engines",
"freq": 568,
"score": 0.8333333
},
{
"text": "engles",
"freq": 3,
"score": 0.8333333
},
{
"text": "eggies",
"freq": 1,
"score": 0.8333333
}
]
}
]
}
...
}
39.1. Term suggester
|
|
In order to understand the format of suggestions, please read the Suggesters page first. |
The term suggester suggests terms based on edit distance. The provided
suggest text is analyzed before terms are suggested. The suggested terms
are provided per analyzed suggest text token. The term suggester
doesn’t take the query into account that is part of request.
39.1.1. Common suggest options:
text
|
The suggest text. The suggest text is a required option that needs to be set globally or per suggestion. |
field
|
The field to fetch the candidate suggestions from. This is an required option that either needs to be set globally or per suggestion. |
analyzer
|
The analyzer to analyse the suggest text with. Defaults to the search analyzer of the suggest field. |
size
|
The maximum corrections to be returned per suggest text token. |
sort
|
Defines how suggestions should be sorted per suggest text term. Two possible values:
|
suggest_mode
|
The suggest mode controls what suggestions are included or controls for what suggest text terms, suggestions should be suggested. Three possible values can be specified:
|
39.1.2. Other term suggest options:
lowercase_terms
|
Lower cases the suggest text terms after text analysis. |
max_edits
|
The maximum edit distance candidate suggestions can have in order to be considered as a suggestion. Can only be a value between 1 and 2. Any other value result in an bad request error being thrown. Defaults to 2. |
prefix_length
|
The number of minimal prefix characters that must match in order be a candidate suggestions. Defaults to 1. Increasing this number improves spellcheck performance. Usually misspellings don’t occur in the beginning of terms. (Old name "prefix_len" is deprecated) |
min_word_length
|
The minimum length a suggest text term must have in order to be included. Defaults to 4. (Old name "min_word_len" is deprecated) |
shard_size
|
Sets the maximum number of suggestions to be retrieved
from each individual shard. During the reduce phase only the top N
suggestions are returned based on the |
max_inspections
|
A factor that is used to multiply with the
|
min_doc_freq
|
The minimal threshold in number of documents a suggestion should appear in. This can be specified as an absolute number or as a relative percentage of number of documents. This can improve quality by only suggesting high frequency terms. Defaults to 0f and is not enabled. If a value higher than 1 is specified then the number cannot be fractional. The shard level document frequencies are used for this option. |
max_term_freq
|
The maximum threshold in number of documents a suggest text token can exist in order to be included. Can be a relative percentage number (e.g 0.4) or an absolute number to represent document frequencies. If an value higher than 1 is specified then fractional can not be specified. Defaults to 0.01f. This can be used to exclude high frequency terms from being spellchecked. High frequency terms are usually spelled correctly on top of this also improves the spellcheck performance. The shard level document frequencies are used for this option. |
string_distance
|
Which string distance implementation to use for comparing how similar
suggested terms are. Five possible values can be specified:
|
39.2. Phrase Suggester
|
|
In order to understand the format of suggestions, please read the Suggesters page first. |
The term suggester provides a very convenient API to access word
alternatives on a per token basis within a certain string distance. The API
allows accessing each token in the stream individually while
suggest-selection is left to the API consumer. Yet, often pre-selected
suggestions are required in order to present to the end-user. The
phrase suggester adds additional logic on top of the term suggester
to select entire corrected phrases instead of individual tokens weighted
based on ngram-language models. In practice this suggester will be
able to make better decisions about which tokens to pick based on
co-occurrence and frequencies.
39.2.1. API Example
The phrase request is defined along side the query part in the json
request:
curl -XPOST 'localhost:9200/_search' -d '{
"suggest" : {
"text" : "Xor the Got-Jewel",
"simple_phrase" : {
"phrase" : {
"analyzer" : "body",
"field" : "bigram",
"size" : 1,
"real_word_error_likelihood" : 0.95,
"max_errors" : 0.5,
"gram_size" : 2,
"direct_generator" : [ {
"field" : "body",
"suggest_mode" : "always",
"min_word_length" : 1
} ],
"highlight": {
"pre_tag": "<em>",
"post_tag": "</em>"
}
}
}
}
}'
The response contains suggestions scored by the most likely spell
correction first. In this case we received the expected correction
xorr the god jewel first while the second correction is less
conservative where only one of the errors is corrected. Note, the
request is executed with max_errors set to 0.5 so 50% of the terms
can contain misspellings (See parameter descriptions below).
{
"took" : 5,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 2938,
"max_score" : 0.0,
"hits" : [ ]
},
"suggest" : {
"simple_phrase" : [ {
"text" : "Xor the Got-Jewel",
"offset" : 0,
"length" : 17,
"options" : [ {
"text" : "xorr the god jewel",
"highlighted": "<em>xorr</em> the <em>god</em> jewel",
"score" : 0.17877324
}, {
"text" : "xor the god jewel",
"highlighted": "xor the <em>god</em> jewel",
"score" : 0.14231323
} ]
} ]
}
}
39.2.2. Basic Phrase suggest API parameters
field
|
the name of the field used to do n-gram lookups for the language model, the suggester will use this field to gain statistics to score corrections. This field is mandatory. |
gram_size
|
sets max size of the n-grams (shingles) in the |
real_word_error_likelihood
|
the likelihood of a term being a
misspelled even if the term exists in the dictionary. The default is
|
confidence
|
The confidence level defines a factor applied to the
input phrases score which is used as a threshold for other suggest
candidates. Only candidates that score higher than the threshold will be
included in the result. For instance a confidence level of |
max_errors
|
the maximum percentage of the terms that at most
considered to be misspellings in order to form a correction. This method
accepts a float value in the range |
separator
|
the separator that is used to separate terms in the bigram field. If not set the whitespace character is used as a separator. |
size
|
the number of candidates that are generated for each
individual query term Low numbers like |
analyzer
|
Sets the analyzer to analyse to suggest text with.
Defaults to the search analyzer of the suggest field passed via |
shard_size
|
Sets the maximum number of suggested term to be
retrieved from each individual shard. During the reduce phase, only the
top N suggestions are returned based on the |
text
|
Sets the text / query to provide suggestions for. |
highlight
|
Sets up suggestion highlighting. If not provided then
no |
collate
|
Checks each suggestion against the specified |
curl -XPOST 'localhost:9200/_search' -d {
"suggest" : {
"text" : "Xor the Got-Jewel",
"simple_phrase" : {
"phrase" : {
"field" : "bigram",
"size" : 1,
"direct_generator" : [ {
"field" : "body",
"suggest_mode" : "always",
"min_word_length" : 1
} ],
"collate": {
"query": {
"inline" : {
"match": {
"{{field_name}}" : "{{suggestion}}"
}
}
},
"params": {"field_name" : "title"},
"prune": true
}
}
}
}
}
| This query will be run once for every suggestion. | |
The {{suggestion}} variable will be replaced by the text
of each suggestion. |
|
An additional field_name variable has been specified in
params and is used by the match query. |
|
All suggestions will be returned with an extra collate_match
option indicating whether the generated phrase matched any
document. |
39.2.3. Smoothing Models
The phrase suggester supports multiple smoothing models to balance
weight between infrequent grams (grams (shingles) are not existing in
the index) and frequent grams (appear at least once in the index).
stupid_backoff
|
a simple backoff model that backs off to lower
order n-gram models if the higher order count is |
laplace
|
a smoothing model that uses an additive smoothing where a
constant (typically |
linear_interpolation
|
a smoothing model that takes the weighted
mean of the unigrams, bigrams and trigrams based on user supplied
weights (lambdas). Linear Interpolation doesn’t have any default values.
All parameters ( |
39.2.4. Candidate Generators
The phrase suggester uses candidate generators to produce a list of
possible terms per term in the given text. A single candidate generator
is similar to a term suggester called for each individual term in the
text. The output of the generators is subsequently scored in combination
with the candidates from the other terms to for suggestion candidates.
Currently only one type of candidate generator is supported, the
direct_generator. The Phrase suggest API accepts a list of generators
under the key direct_generator each of the generators in the list are
called per term in the original text.
39.2.5. Direct Generators
The direct generators support the following parameters:
field
|
The field to fetch the candidate suggestions from. This is a required option that either needs to be set globally or per suggestion. |
size
|
The maximum corrections to be returned per suggest text token. |
suggest_mode
|
The suggest mode controls what suggestions are included on the suggestions
generated on each shard. All values other than
|
max_edits
|
The maximum edit distance candidate suggestions can have in order to be considered as a suggestion. Can only be a value between 1 and 2. Any other value result in an bad request error being thrown. Defaults to 2. |
prefix_length
|
The number of minimal prefix characters that must match in order be a candidate suggestions. Defaults to 1. Increasing this number improves spellcheck performance. Usually misspellings don’t occur in the beginning of terms. (Old name "prefix_len" is deprecated) |
min_word_length
|
The minimum length a suggest text term must have in order to be included. Defaults to 4. (Old name "min_word_len" is deprecated) |
max_inspections
|
A factor that is used to multiply with the
|
min_doc_freq
|
The minimal threshold in number of documents a suggestion should appear in. This can be specified as an absolute number or as a relative percentage of number of documents. This can improve quality by only suggesting high frequency terms. Defaults to 0f and is not enabled. If a value higher than 1 is specified then the number cannot be fractional. The shard level document frequencies are used for this option. |
max_term_freq
|
The maximum threshold in number of documents a suggest text token can exist in order to be included. Can be a relative percentage number (e.g 0.4) or an absolute number to represent document frequencies. If an value higher than 1 is specified then fractional can not be specified. Defaults to 0.01f. This can be used to exclude high frequency terms from being spellchecked. High frequency terms are usually spelled correctly on top of this also improves the spellcheck performance. The shard level document frequencies are used for this option. |
pre_filter
|
a filter (analyzer) that is applied to each of the tokens passed to this candidate generator. This filter is applied to the original token before candidates are generated. |
post_filter
|
a filter (analyzer) that is applied to each of the generated tokens before they are passed to the actual phrase scorer. |
The following example shows a phrase suggest call with two generators,
the first one is using a field containing ordinary indexed terms and the
second one uses a field that uses terms indexed with a reverse filter
(tokens are index in reverse order). This is used to overcome the limitation
of the direct generators to require a constant prefix to provide
high-performance suggestions. The pre_filter and post_filter options
accept ordinary analyzer names.
curl -s -XPOST 'localhost:9200/_search' -d {
"suggest" : {
"text" : "Xor the Got-Jewel",
"simple_phrase" : {
"phrase" : {
"analyzer" : "body",
"field" : "bigram",
"size" : 4,
"real_word_error_likelihood" : 0.95,
"confidence" : 2.0,
"gram_size" : 2,
"direct_generator" : [ {
"field" : "body",
"suggest_mode" : "always",
"min_word_length" : 1
}, {
"field" : "reverse",
"suggest_mode" : "always",
"min_word_length" : 1,
"pre_filter" : "reverse",
"post_filter" : "reverse"
} ]
}
}
}
}
pre_filter and post_filter can also be used to inject synonyms after
candidates are generated. For instance for the query captain usq we
might generate a candidate usa for term usq which is a synonym for
america which allows to present captain america to the user if this
phrase scores high enough.
39.3. Completion Suggester
|
|
In order to understand the format of suggestions, please read the Suggesters page first. |
The completion suggester is a so-called prefix suggester. It does not
do spell correction like the term or phrase suggesters but allows
basic auto-complete functionality.
39.3.1. Why another suggester? Why not prefix queries?
The first question which comes to mind when reading about a prefix suggestion is, why you should use it at all, if you have prefix queries already. The answer is simple: Prefix suggestions are fast.
The data structures are internally backed by Lucenes
AnalyzingSuggester, which uses FSTs (finite state transducers) to
execute suggestions. Usually these data structures are costly to
create, stored in-memory and need to be rebuilt every now and then to
reflect changes in your indexed documents. The completion suggester
circumvents this by storing the FST (finite state transducer) as part
of your index during index time. This allows for really fast
loads and executions.
39.3.2. Mapping
In order to use this feature, you have to specify a special mapping for this field, which enables the special storage of the field.
curl -X PUT localhost:9200/music
curl -X PUT localhost:9200/music/song/_mapping -d '{
"song" : {
"properties" : {
"name" : { "type" : "string" },
"suggest" : { "type" : "completion",
"analyzer" : "simple",
"search_analyzer" : "simple",
"payloads" : true
}
}
}
}'
Mapping supports the following parameters:
analyzer-
The index analyzer to use, defaults to
simple. In case you are wondering why we did not opt for thestandardanalyzer: We try to have easy to understand behaviour here, and if you index the field contentAt the Drive-in, you will not get any suggestions fora, nor ford(the first non stopword). search_analyzer-
The search analyzer to use, defaults to value of
analyzer. payloads-
Enables the storing of payloads, defaults to
false preserve_separators-
Preserves the separators, defaults to
true. If disabled, you could find a field starting withFoo Fighters, if you suggest forfoof. preserve_position_increments-
Enables position increments, defaults to
true. If disabled and using stopwords analyzer, you could get a field starting withThe Beatles, if you suggest forb. Note: You could also achieve this by indexing two inputs,BeatlesandThe Beatles, no need to change a simple analyzer, if you are able to enrich your data. max_input_length-
Limits the length of a single input, defaults to
50UTF-16 code points. This limit is only used at index time to reduce the total number of characters per input string in order to prevent massive inputs from bloating the underlying datastructure. The most usecases won’t be influenced by the default value since prefix completions hardly grow beyond prefixes longer than a handful of characters. (Old name "max_input_len" is deprecated)
39.3.3. Indexing
curl -X PUT 'localhost:9200/music/song/1?refresh=true' -d '{
"name" : "Nevermind",
"suggest" : {
"input": [ "Nevermind", "Nirvana" ],
"output": "Nirvana - Nevermind",
"payload" : { "artistId" : 2321 },
"weight" : 34
}
}'
The following parameters are supported:
input-
The input to store, this can be an array of strings or just a string. This field is mandatory.
output-
The string to return, if a suggestion matches. This is very useful to normalize outputs (i.e. have them always in the format
artist - songname). This is optional. Note: The result is de-duplicated if several documents have the same output, i.e. only one is returned as part of the suggest result. payload-
An arbitrary JSON object, which is simply returned in the suggest option. You could store data like the id of a document, in order to load it from elasticsearch without executing another search (which might not yield any results, if
inputandoutputdiffer strongly). weight-
A positive integer or a string containing a positive integer, which defines a weight and allows you to rank your suggestions. This field is optional.
|
|
Even though you will lose most of the features of the completion suggest, you can choose to use the following shorthand form. Keep in mind that you will not be able to use several inputs, an output, payloads or weights. This form does still work inside of multi fields. |
{
"suggest" : "Nirvana"
}
|
|
The suggest data structure might not reflect deletes on
documents immediately. You may need to do an Optimize for that.
You can call optimize with the only_expunge_deletes=true to only target
deletions for merging. By default only_expunge_deletes=true will only select
segments for merge where the percentage of deleted documents is greater than 10% of
the number of document in that segment. To adjust this index.merge.policy.expunge_deletes_allowed can
be updated to a value between [0..100]. Please remember even with this option set, optimize
is considered a extremely heavy operation and should be called rarely.
|
39.3.4. Querying
Suggesting works as usual, except that you have to specify the suggest
type as completion.
curl -X POST 'localhost:9200/music/_suggest?pretty' -d '{
"song-suggest" : {
"text" : "n",
"completion" : {
"field" : "suggest"
}
}
}'
{
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"song-suggest" : [ {
"text" : "n",
"offset" : 0,
"length" : 1,
"options" : [ {
"text" : "Nirvana - Nevermind",
"score" : 34.0, "payload" : {"artistId":2321}
} ]
} ]
}
As you can see, the payload is included in the response, if configured
appropriately. If you configured a weight for a suggestion, this weight
is used as score. Also the text field uses the output of your
indexed suggestion, if configured, otherwise the matched part of the
input field.
The basic completion suggester query supports the following two parameters:
field-
The name of the field on which to run the query (required).
size-
The number of suggestions to return (defaults to
5).
|
|
The completion suggester considers all documents in the index. See Context Suggester for an explanation of how to query a subset of documents instead. |
39.3.5. Fuzzy queries
The completion suggester also supports fuzzy queries - this means, you can actually have a typo in your search and still get results back.
curl -X POST 'localhost:9200/music/_suggest?pretty' -d '{
"song-suggest" : {
"text" : "n",
"completion" : {
"field" : "suggest",
"fuzzy" : {
"fuzziness" : 2
}
}
}
}'
The fuzzy query can take specific fuzzy parameters. The following parameters are supported:
fuzziness
|
The fuzziness factor, defaults to |
transpositions
|
if set to |
min_length
|
Minimum length of the input before fuzzy
suggestions are returned, defaults |
prefix_length
|
Minimum length of the input, which is not
checked for fuzzy alternatives, defaults to |
unicode_aware
|
Sets all are measurements (like edit distance, transpositions and lengths) in unicode code points (actual letters) instead of bytes. |
|
|
If you want to stick with the default values, but
still use fuzzy, you can either use fuzzy: {}
or fuzzy: true.
|
39.4. Context Suggester
The context suggester is an extension to the suggest API of Elasticsearch. Namely the
suggester system provides a very fast way of searching documents by handling these
entirely in memory. But this special treatment does not allow the handling of
traditional queries and filters, because those would have notable impact on the
performance. So the context extension is designed to take so-called context information
into account to specify a more accurate way of searching within the suggester system.
Instead of using the traditional query and filter system a predefined ``context`` is
configured to limit suggestions to a particular subset of suggestions.
Such a context is defined by a set of context mappings which can either be a simple
category or a geo location. The information used by the context suggester is
configured in the type mapping with the context parameter, which lists all of the
contexts that need to be specified in each document and in each suggestion request.
For instance:
PUT services/_mapping/service
{
"service": {
"properties": {
"name": {
"type" : "string"
},
"tag": {
"type" : "string"
},
"suggest_field": {
"type": "completion",
"context": {
"color": {
"type": "category",
"path": "color_field",
"default": ["red", "green", "blue"]
},
"location": {
"type": "geo",
"precision": "5m",
"neighbors": true,
"default": "u33"
}
}
}
}
}
}
| See Category Context | |
| See Geo location Context |
However contexts are specified (as type category or geo, which are discussed below), each
context value generates a new sub-set of documents which can be queried by the completion
suggester. All three types accept a default parameter which provides a default value to use
if the corresponding context value is absent.
The basic structure of this element is that each field forms a new context and the fieldname
is used to reference this context information later on during indexing or querying. All context
mappings have the default and the type option in common. The value of the default field
is used, when ever no specific is provided for the certain context. Note that a context is
defined by at least one value. The type option defines the kind of information hold by this
context. These type will be explained further in the following sections.
Category Context
The category context allows you to specify one or more categories in the document at index time.
The document will be assigned to each named category, which can then be queried later. The category
type also allows to specify a field to extract the categories from. The path parameter is used to
specify this field of the documents that should be used. If the referenced field contains multiple
values, all these values will be used as alternative categories.
Category Mapping
The mapping for a category is simply defined by its default values. These can either be
defined as list of default categories:
"context": {
"color": {
"type": "category",
"default": ["red", "orange"]
}
}
or as a single value
"context": {
"color": {
"type": "category",
"default": "red"
"contexts": {
"place_type": ["cafe", "food"]
}
}
}
or as reference to another field within the documents indexed:
"context": {
"color": {
"type": "category",
"default": "red",
"path": "color_field"
}
}
in this case the default categories will only be used, if the given field does not
exist within the document. In the example above the categories are received from a
field named color_field. If this field does not exist a category red is assumed for
the context color.
Indexing category contexts
Within a document the category is specified either as an array of values, a
single value or null. A list of values is interpreted as alternative categories. So
a document belongs to all the categories defined. If the category is null or remains
unset the categories will be retrieved from the documents field addressed by the path
parameter. If this value is not set or the field is missing, the default values of the
mapping will be assigned to the context.
PUT services/service/1
{
"name": "knapsack",
"suggest_field": {
"input": ["knacksack", "backpack", "daypack"],
"context": {
"color": ["red", "yellow"]
}
}
}
Category Query
A query within a category works similar to the configuration. If the value is null
the mappings default categories will be used. Otherwise the suggestion takes place
for all documents that have at least one category in common with the query.
POST services/_suggest?pretty'
{
"suggest" : {
"text" : "m",
"completion" : {
"field" : "suggest_field",
"size": 10,
"context": {
"color": "red"
}
}
}
}
Geo location Context
A geo context allows you to limit results to those that lie within a certain distance
of a specified geolocation. At index time, a lat/long geo point is converted into a
geohash of a certain precision, which provides the context.
Geo location Mapping
The mapping for a geo context accepts four settings, only of which precision is required:
precision
|
This defines the precision of the geohash and can be specified as |
neighbors
|
Geohashes are rectangles, so a geolocation, which in reality is only 1 metre
away from the specified point, may fall into the neighbouring rectangle. Set
|
path
|
Optionally specify a field to use to look up the geopoint. |
default
|
The geopoint to use if no geopoint has been specified. |
Since all locations of this mapping are translated into geohashes, each location matches
a geohash cell. So some results that lie within the specified range but not in the same
cell as the query location will not match. To avoid this the neighbors option allows a
matching of cells that join the bordering regions of the documents location. This option
is turned on by default.
If a document or a query doesn’t define a location a value to use instead can defined by
the default option. The value of this option supports all the ways a geo_point can be
defined. The path refers to another field within the document to retrieve the
location. If this field contains multiple values, the document will be linked to all these
locations.
"context": {
"location": {
"type": "geo",
"precision": ["1km", "5m"],
"neighbors": true,
"path": "pin",
"default": {
"lat": 0.0,
"lon": 0.0
}
}
}
Geo location Config
Within a document a geo location retrieved from the mapping definition can be overridden
by another location. In this case the context mapped to a geo location supports all
variants of defining a geo_point.
PUT services/service/1
{
"name": "some hotel 1",
"suggest_field": {
"input": ["my hotel", "this hotel"],
"context": {
"location": {
"lat": 0,
"lon": 0
"contexts": [
"location": [
{
"lat": 43.6624803,
"lon": -79.3863353
},
{
"lat": 43.6624718,
"lon": -79.3873227
}
}
}
}
Geo location Query
Like in the configuration, querying with a geo location in context, the geo location
query supports all representations of a geo_point to define the location. In this
simple case all precision values defined in the mapping will be applied to the given
location.
POST services/_suggest
{
"suggest" : {
"text" : "m",
"completion" : {
"field" : "suggest_field",
"size": 10,
"contexts": {
"location": {
"lat": 0,
"lon": 0
}
}
}
}
}
But it also possible to set a subset of the precisions set in the mapping, by using the
precision parameter. Like in the mapping, this parameter is allowed to be set to a
single precision value or a list of these.
POST services/_suggest
{
"suggest" : {
"text" : "m",
"completion" : {
"field" : "suggest_field",
"size": 10,
"context": {
"location": {
"value": {
"lat": 0,
"lon": 0
},
"precision": "1km"
}
}
}
}
}
A special form of the query is defined by an extension of the object representation of
the geo_point. Using this representation allows to set the precision parameter within
the location itself:
POST services/_suggest
{
"suggest" : {
"text" : "m",
"completion" : {
"field" : "suggest_field",
"size": 10,
"context": {
"location": {
"lat": 0,
"lon": 0,
"precision": "1km"
}
}
}
}
}
40. Multi Search API
The multi search API allows to execute several search requests within
the same API. The endpoint for it is _msearch.
The format of the request is similar to the bulk API format, and the structure is as follows (the structure is specifically optimized to reduce parsing if a specific search ends up redirected to another node):
header\n
body\n
header\n
body\n
The header part includes which index / indices to search on, optional
(mapping) types to search on, the search_type, preference, and
routing. The body includes the typical search body request (including
the query, aggregations, from, size, and so on). Here is an example:
$ cat requests
{"index" : "test"}
{"query" : {"match_all" : {}}, "from" : 0, "size" : 10}
{"index" : "test", "search_type" : "dfs_query_then_fetch"}
{"query" : {"match_all" : {}}}
{}
{"query" : {"match_all" : {}}}
{"query" : {"match_all" : {}}}
{"search_type" : "dfs_query_then_fetch"}
{"query" : {"match_all" : {}}}
$ curl -XGET localhost:9200/_msearch --data-binary "@requests"; echo
Note, the above includes an example of an empty header (can also be just without any content) which is supported as well.
The response returns a responses array, which includes the search
response for each search request matching its order in the original
multi search request. If there was a complete failure for that specific
search request, an object with error message will be returned in place
of the actual search response.
The endpoint allows to also search against an index/indices and type/types in the URI itself, in which case it will be used as the default unless explicitly defined otherwise in the header. For example:
$ cat requests
{}
{"query" : {"match_all" : {}}, "from" : 0, "size" : 10}
{}
{"query" : {"match_all" : {}}}
{"index" : "test2"}
{"query" : {"match_all" : {}}}
$ curl -XGET localhost:9200/test/_msearch --data-binary @requests; echo
The above will execute the search against the test index for all the
requests that don’t define an index, and the last one will be executed
against the test2 index.
The search_type can be set in a similar manner to globally apply to
all search requests.
Security
41. Count API
The count API allows to easily execute a query and get the number of matches for that query. It can be executed across one or more indices and across one or more types. The query can either be provided using a simple query string as a parameter, or using the Query DSL defined within the request body. Here is an example:
$ curl -XGET 'http://localhost:9200/twitter/tweet/_count?q=user:kimchy'
$ curl -XGET 'http://localhost:9200/twitter/tweet/_count' -d '
{
"query" : {
"term" : { "user" : "kimchy" }
}
}'
|
|
The query being sent in the body must be nested in a query key, same as
the search api works
|
Both examples above do the same thing, which is count the number of tweets from the twitter index for a certain user. The result is:
{
"count" : 1,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
}
}
The query is optional, and when not provided, it will use match_all to
count all the docs.
Multi index, Multi type
The count API can be applied to multiple types in multiple indices.
Request Parameters
When executing count using the query parameter q, the query passed is
a query string using Lucene query parser. There are additional
parameters that can be passed:
| Name | Description |
|---|---|
|
The default field to use when no field prefix is defined within the query. |
|
The analyzer name to be used when analyzing the query string. |
|
The default operator to be used, can be |
|
If set to true will cause format based failures (like providing text to a numeric field) to be ignored. Defaults to false. |
|
Should terms be automatically lowercased or
not. Defaults to |
|
Should wildcard and prefix queries be analyzed or
not. Defaults to |
|
The maximum count for each shard, upon
reaching which the query execution will terminate early.
If set, the response will have a boolean field |
Request Body
The count can use the Query DSL within
its body in order to express the query that should be executed. The body
content can also be passed as a REST parameter named source.
Both HTTP GET and HTTP POST can be used to execute count with body. Since not all clients support GET with body, POST is allowed as well.
Distributed
The count operation is broadcast across all shards. For each shard id group, a replica is chosen and executed against it. This means that replicas increase the scalability of count.
Routing
The routing value (a comma separated list of the routing values) can be specified to control which shards the count request will be executed on.
42. Search Exists API
deprecated[2.1.0, use regular _search with size set to 0 and terminate_after set to 1 instead]
The exists API allows to easily determine if any matching documents exist for a provided query. It can be executed across one or more indices and across one or more types. The query can either be provided using a simple query string as a parameter, or using the Query DSL defined within the request body. Here is an example:
$ curl -XGET 'http://localhost:9200/twitter/tweet/_search/exists?q=user:kimchy'
$ curl -XGET 'http://localhost:9200/twitter/tweet/_search/exists' -d '
{
"query" : {
"term" : { "user" : "kimchy" }
}
}'
|
|
The query being sent in the body must be nested in a query key, same as
how the search api works.
|
Both the examples above do the same thing, which is determine the existence of tweets from the twitter index for a certain user. The response body will be of the following format:
{
"exists" : true
}
Multi index, Multi type
The exists API can be applied to multiple types in multiple indices.
Request Parameters
When executing exists using the query parameter q, the query passed is
a query string using Lucene query parser. There are additional
parameters that can be passed:
| Name | Description |
|---|---|
|
The default field to use when no field prefix is defined within the query. |
|
The analyzer name to be used when analyzing the query string. |
|
The default operator to be used, can be |
|
If set to true will cause format based failures (like providing text to a numeric field) to be ignored. Defaults to false. |
|
Should terms be automatically lowercased or
not. Defaults to |
|
Should wildcard and prefix queries be analyzed or
not. Defaults to |
Request Body
The exists API can use the Query DSL within
its body in order to express the query that should be executed. The body
content can also be passed as a REST parameter named source.
HTTP GET and HTTP POST can be used to execute exists with body. Since not all clients support GET with body, POST is allowed as well.
Distributed
The exists operation is broadcast across all shards. For each shard id group, a replica is chosen and executed against it. This means that replicas increase the scalability of exists. The exists operation also early terminates shard requests once the first shard reports matched document existence.
Routing
The routing value (a comma separated list of the routing values) can be specified to control which shards the exists request will be executed on.
43. Validate API
The validate API allows a user to validate a potentially expensive query without executing it. The following example shows how it can be used:
curl -XPUT 'http://localhost:9200/twitter/tweet/1' -d '{
"user" : "kimchy",
"post_date" : "2009-11-15T14:12:12",
"message" : "trying out Elasticsearch"
}'
When the query is valid, the response contains valid:true:
curl -XGET 'http://localhost:9200/twitter/_validate/query?q=user:foo'
{"valid":true,"_shards":{"total":1,"successful":1,"failed":0}}
Request Parameters
When executing exists using the query parameter q, the query passed is
a query string using Lucene query parser. There are additional
parameters that can be passed:
| Name | Description |
|---|---|
|
The default field to use when no field prefix is defined within the query. |
|
The analyzer name to be used when analyzing the query string. |
|
The default operator to be used, can be |
|
If set to true will cause format based failures (like providing text to a numeric field) to be ignored. Defaults to false. |
|
Should terms be automatically lowercased or
not. Defaults to |
|
Should wildcard and prefix queries be analyzed or
not. Defaults to |
Or, with a request body:
curl -XGET 'http://localhost:9200/twitter/tweet/_validate/query' -d '{
"query" : {
"bool" : {
"must" : {
"query_string" : {
"query" : "*:*"
}
},
"filter" : {
"term" : { "user" : "kimchy" }
}
}
}
}'
{"valid":true,"_shards":{"total":1,"successful":1,"failed":0}}
|
|
The query being sent in the body must be nested in a query key, same as
the search api works
|
If the query is invalid, valid will be false. Here the query is
invalid because Elasticsearch knows the post_date field should be a date
due to dynamic mapping, and foo does not correctly parse into a date:
curl -XGET 'http://localhost:9200/twitter/tweet/_validate/query?q=post_date:foo'
{"valid":false,"_shards":{"total":1,"successful":1,"failed":0}}
An explain parameter can be specified to get more detailed information
about why a query failed:
curl -XGET 'http://localhost:9200/twitter/tweet/_validate/query?q=post_date:foo&pretty=true&explain=true'
{
"valid" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"explanations" : [ {
"index" : "twitter",
"valid" : false,
"error" : "[twitter] QueryParsingException[Failed to parse]; nested: IllegalArgumentException[Invalid format: \"foo\"];; java.lang.IllegalArgumentException: Invalid format: \"foo\""
} ]
}
When the query is valid, the explanation defaults to the string
representation of that query. With rewrite set to true, the explanation
is more detailed showing the actual Lucene query that will be executed.
For Fuzzy Queries:
curl -XGET 'http://localhost:9200/imdb/movies/_validate/query?rewrite=true' -d '
{
"query": {
"fuzzy": {
"actors": "kyle"
}
}
}'
Response:
{
"valid": true,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"explanations": [
{
"index": "imdb",
"valid": true,
"explanation": "plot:kyle plot:kylie^0.75 plot:kyne^0.75 plot:lyle^0.75 plot:pyle^0.75 #_type:movies"
}
]
}
For More Like This:
curl -XGET 'http://localhost:9200/imdb/movies/_validate/query?rewrite=true'
{
"query": {
"more_like_this": {
"like": {
"_id": "88247"
},
"boost_terms": 1
}
}
}
Response:
{
"valid": true,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"explanations": [
{
"index": "imdb",
"valid": true,
"explanation": "((title:terminator^3.71334 plot:future^2.763601 plot:human^2.8415773 plot:sarah^3.4193945 plot:kyle^3.8244398 plot:cyborg^3.9177752 plot:connor^4.040236 plot:reese^4.7133346 ... )~6) -ConstantScore(_uid:movies#88247) #_type:movies"
}
]
}
|
|
The request is executed on a single shard only, which is randomly selected. The detailed explanation of the query may depend on which shard is being hit, and therefore may vary from one request to another. |
44. Explain API
The explain api computes a score explanation for a query and a specific document. This can give useful feedback whether a document matches or didn’t match a specific query.
The index and type parameters expect a single index and a single
type respectively.
Usage
Full query example:
curl -XGET 'localhost:9200/twitter/tweet/1/_explain' -d '{
"query" : {
"term" : { "message" : "search" }
}
}'
This will yield the following result:
{
"matches" : true,
"explanation" : {
"value" : 0.15342641,
"description" : "fieldWeight(message:search in 0), product of:",
"details" : [ {
"value" : 1.0,
"description" : "tf(termFreq(message:search)=1)"
}, {
"value" : 0.30685282,
"description" : "idf(docFreq=1, maxDocs=1)"
}, {
"value" : 0.5,
"description" : "fieldNorm(field=message, doc=0)"
} ]
}
}
There is also a simpler way of specifying the query via the q
parameter. The specified q parameter value is then parsed as if the
query_string query was used. Example usage of the q parameter in the
explain api:
curl -XGET 'localhost:9200/twitter/tweet/1/_explain?q=message:search'
This will yield the same result as the previous request.
All parameters:
_source
|
Set to |
fields
|
Allows to control which stored fields to return as part of the document explained. |
routing
|
Controls the routing in the case the routing was used during indexing. |
parent
|
Same effect as setting the routing parameter. |
preference
|
Controls on which shard the explain is executed. |
source
|
Allows the data of the request to be put in the query string of the url. |
q
|
The query string (maps to the query_string query). |
df
|
The default field to use when no field prefix is defined within the query. Defaults to _all field. |
analyzer
|
The analyzer name to be used when analyzing the query string. Defaults to the analyzer of the _all field. |
analyze_wildcard
|
Should wildcard and prefix queries be analyzed or not. Defaults to false. |
lowercase_expanded_terms
|
Should terms be automatically lowercased or not. Defaults to true. |
lenient
|
If set to true will cause format based failures (like providing text to a numeric field) to be ignored. Defaults to false. |
default_operator
|
The default operator to be used, can be AND or OR. Defaults to OR. |
45. Profile API
experimental[]
The Profile API provides detailed timing information about the execution of individual components in a query. It gives the user insight into how queries are executed at a low level so that the user can understand why certain queries are slow, and take steps to improve their slow queries.
The output from the Profile API is very verbose, especially for complicated queries executed across many shards. Pretty-printing the response is recommended to help understand the output
|
|
The details provided by the Profile API directly expose Lucene class names and concepts, which means that complete interpretation of the results require fairly advanced knowledge of Lucene. This page attempts to give a crash-course in how Lucene executes queries so that you can use the Profile API to successfully diagnose and debug queries, but it is only an overview. For complete understanding, please refer to Lucene’s documentation and, in places, the code. With that said, a complete understanding is often not required to fix a slow query. It is usually
sufficient to see that a particular component of a query is slow, and not necessarily understand why
the |
Usage
Any _search request can be profiled by adding a top-level profile parameter:
curl -XGET 'localhost:9200/_search' -d '{
"profile": true,
"query" : {
"match" : { "message" : "search test" }
}
}
Setting the top-level profile parameter to true will enable profiling
for the search |
This will yield the following result:
{
"took": 25,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [ ... ]
},
"profile": {
"shards": [
{
"id": "[htuC6YnSSSmKFq5UBt0YMA][test][0]",
"searches": [
{
"query": [
{
"query_type": "BooleanQuery",
"lucene": "message:search message:test",
"time": "15.52889800ms",
"breakdown": {
"score": 0,
"next_doc": 24495,
"match": 0,
"create_weight": 8488388,
"build_scorer": 7016015,
"advance": 0
},
"children": [
{
"query_type": "TermQuery",
"lucene": "message:search",
"time": "4.938855000ms",
"breakdown": {
"score": 0,
"next_doc": 18332,
"match": 0,
"create_weight": 2945570,
"build_scorer": 1974953,
"advance": 0
}
},
{
"query_type": "TermQuery",
"lucene": "message:test",
"time": "0.5016660000ms",
"breakdown": {
"score": 0,
"next_doc": 0,
"match": 0,
"create_weight": 170534,
"build_scorer": 331132,
"advance": 0
}
}
]
}
],
"rewrite_time": 185002,
"collector": [
{
"name": "SimpleTopScoreDocCollector",
"reason": "search_top_hits",
"time": "2.206529000ms"
}
]
}
]
}
]
}
}
| Search results are returned, but were omitted here for brevity |
Even for a simple query, the response is relatively complicated. Let’s break it down piece-by-piece before moving to more complex examples.
First, the overall structure of the profile response is as follows:
{
"profile": {
"shards": [
{
"id": "[htuC6YnSSSmKFq5UBt0YMA][test][0]",
"searches": [
{
"query": [...],
"rewrite_time": 185002,
"collector": [...]
}
]
}
]
}
}
| A profile is returned for each shard that participated in the response, and is identified by a unique ID | |
| Each profile contains a section which holds details about the query execution | |
| Each profile has a single time representing the cumulative rewrite time | |
| Each profile also contains a section about the Lucene Collectors which run the search |
Because a search request may be executed against one or more shards in an index, and a search may cover
one or more indices, the top level element in the profile response is an array of shard objects.
Each shard object lists it’s id which uniquely identifies the shard. The ID’s format is
[nodeID][indexName][shardID].
The profile itself may consist of one or more "searches", where a search is a query executed against the underlying
Lucene index. Most Search Requests submitted by the user will only execute a single search against the Lucene index.
But occasionally multiple searches will be executed, such as including a global aggregation (which needs to execute
a secondary "match_all" query for the global context).
Inside each search object there will be two arrays of profiled information:
a query array and a collector array. In the future, more sections may be added, such as suggest, highlight,
aggregations, etc
There will also be a rewrite metric showing the total time spent rewriting the query (in nanoseconds).
45.1. query Section
The query section contains detailed timing of the query tree executed by Lucene on a particular shard.
The overall structure of this query tree will resemble your original Elasticsearch query, but may be slightly
(or sometimes very) different. It will also use similar but not always identical naming. Using our previous
term query example, let’s analyze the query section:
"query": [
{
"query_type": "BooleanQuery",
"lucene": "message:search message:test",
"time": "15.52889800ms",
"breakdown": {...},
"children": [
{
"query_type": "TermQuery",
"lucene": "message:search",
"time": "4.938855000ms",
"breakdown": {...}
},
{
"query_type": "TermQuery",
"lucene": "message:test",
"time": "0.5016660000ms",
"breakdown": {...}
}
]
}
]
| The breakdown timings are omitted for simplicity |
Based on the profile structure, we can see that our match query was rewritten by Lucene into a BooleanQuery with two
clauses (both holding a TermQuery). The "query_type" field displays the Lucene class name, and often aligns with
the equivalent name in Elasticsearch. The "lucene" field displays the Lucene explanation text for the query, and
is made available to help differentiating between parts of your query (e.g. both "message:search" and "message:test"
are TermQuery’s and would appear identical otherwise.
The "time" field shows that this query took ~15ms for the entire BooleanQuery to execute. The recorded time is inclusive
of all children.
The "breakdown" field will give detailed stats about how the time was spent, we’ll look at
that in a moment. Finally, the "children" array lists any sub-queries that may be present. Because we searched for two
values ("search test"), our BooleanQuery holds two children TermQueries. They have identical information (query_type, time,
breakdown, etc). Children are allowed to have their own children.
45.1.1. Timing Breakdown
The breakdown component lists detailed timing statistics about low-level Lucene execution:
"breakdown": {
"score": 0,
"next_doc": 24495,
"match": 0,
"create_weight": 8488388,
"build_scorer": 7016015,
"advance": 0
}
Timings are listed in wall-clock nanoseconds and are not normalized at all. All caveats about the overall
time apply here. The intention of the breakdown is to give you a feel for A) what machinery in Lucene is
actually eating time, and B) the magnitude of differences in times between the various components. Like the overall time,
the breakdown is inclusive of all children times.
The meaning of the stats are as follows:
All parameters:
create_weight
|
A Query in Lucene must be capable of reuse across multiple IndexSearchers (think of it as the engine that
executes a search against a specific Lucene Index). This puts Lucene in a tricky spot, since many queries
need to accumulate temporary state/statistics associated with the index it is being used against, but the
Query contract mandates that it must be immutable.
|
build_scorer
|
This parameter shows how long it takes to build a Scorer for the query. A Scorer is the mechanism that
iterates over matching documents generates a score per-document (e.g. how well does "foo" match the document?).
Note, this records the time required to generate the Scorer object, not actually score the documents. Some
queries have faster or slower initialization of the Scorer, depending on optimizations, complexity, etc.
|
next_doc
|
The Lucene method |
advance
|
|
matches
|
Some queries, such as phrase queries, match documents using a "Two Phase" process. First, the document is
"approximately" matched, and if it matches approximately, it is checked a second time with a more rigorous
(and expensive) process. The second phase verification is what the |
score
|
This records the time taken to score a particular document via it’s Scorer |
45.2. collectors Section
The Collectors portion of the response shows high-level execution details. Lucene works by defining a "Collector" which is responsible for coordinating the traversal, scoring and collection of matching documents. Collectors are also how a single query can record aggregation results, execute unscoped "global" queries, execute post-query filters, etc.
Looking at the previous example:
"collector": [
{
"name": "SimpleTopScoreDocCollector",
"reason": "search_top_hits",
"time": "2.206529000ms"
}
]
We see a single collector named SimpleTopScoreDocCollector. This is the default "scoring and sorting" Collector
used by Elasticsearch. The "reason" field attempts to give an plain english description of the class name. The
"time is similar to the time in the Query tree: a wall-clock time inclusive of all children. Similarly, children lists
all sub-collectors.
It should be noted that Collector times are independent from the Query times. They are calculated, combined and normalized independently! Due to the nature of Lucene’s execution, it is impossible to "merge" the times from the Collectors into the Query section, so they are displayed in separate portions.
For reference, the various collector reason’s are:
search_sorted
|
A collector that scores and sorts documents. This is the most common collector and will be seen in most simple searches |
search_count
|
A collector that only counts the number of documents that match the query, but does not fetch the source.
This is seen when |
search_terminate_after_count
|
A collector that terminates search execution after |
search_min_score
|
A collector that only returns matching documents that have a score greater than |
search_multi
|
A collector that wraps several other collectors. This is seen when combinations of search, aggregations, global aggs and post_filters are combined in a single search. |
search_timeout
|
A collector that halts execution after a specified period of time. This is seen when a |
aggregation
|
A collector that Elasticsearch uses to run aggregations against the query scope. A single |
global_aggregation
|
A collector that executes an aggregation against the global query scope, rather than the specified query. Because the global scope is necessarily different from the executed query, it must execute it’s own match_all query (which you will see added to the Query section) to collect your entire dataset |
45.3. rewrite Section
All queries in Lucene undergo a "rewriting" process. A query (and its sub-queries) may be rewritten one or more times, and the process continues until the query stops changing. This process allows Lucene to perform optimizations, such as removing redundant clauses, replacing one query for a more efficient execution path, etc. For example a Boolean → Boolean → TermQuery can be rewritten to a TermQuery, because all the Booleans are unnecessary in this case.
The rewriting process is complex and difficult to display, since queries can change drastically. Rather than showing the intermediate results, the total rewrite time is simply displayed as a value (in nanoseconds). This value is cumulative and contains the total time for all queries being rewritten.
45.4. A more complex example
To demonstrate a slightly more complex query and the associated results, we can profile the following query:
GET /test/_search
{
"profile": true,
"query": {
"term": {
"message": {
"value": "search"
}
}
},
"aggs": {
"non_global_term": {
"terms": {
"field": "agg"
},
"aggs": {
"second_term": {
"terms": {
"field": "sub_agg"
}
}
}
},
"another_agg": {
"cardinality": {
"field": "aggB"
}
},
"global_agg": {
"global": {},
"aggs": {
"my_agg2": {
"terms": {
"field": "globalAgg"
}
}
}
}
},
"post_filter": {
"term": {
"my_field": "foo"
}
}
}
This example has:
-
A query
-
A scoped aggregation
-
A global aggregation
-
A post_filter
And the response:
{
"profile": {
"shards": [
{
"id": "[P6-vulHtQRWuD4YnubWb7A][test][0]",
"searches": [
{
"query": [
{
"query_type": "TermQuery",
"lucene": "my_field:foo",
"time": "0.4094560000ms",
"breakdown": {
"score": 0,
"next_doc": 0,
"match": 0,
"create_weight": 31584,
"build_scorer": 377872,
"advance": 0
}
},
{
"query_type": "TermQuery",
"lucene": "message:search",
"time": "0.3037020000ms",
"breakdown": {
"score": 0,
"next_doc": 5936,
"match": 0,
"create_weight": 185215,
"build_scorer": 112551,
"advance": 0
}
}
],
"rewrite_time": 7208,
"collector": [
{
"name": "MultiCollector",
"reason": "search_multi",
"time": "1.378943000ms",
"children": [
{
"name": "FilteredCollector",
"reason": "search_post_filter",
"time": "0.4036590000ms",
"children": [
{
"name": "SimpleTopScoreDocCollector",
"reason": "search_top_hits",
"time": "0.006391000000ms"
}
]
},
{
"name": "BucketCollector: [[non_global_term, another_agg]]",
"reason": "aggregation",
"time": "0.9546020000ms"
}
]
}
]
},
{
"query": [
{
"query_type": "MatchAllDocsQuery",
"lucene": "*:*",
"time": "0.04829300000ms",
"breakdown": {
"score": 0,
"next_doc": 3672,
"match": 0,
"create_weight": 6311,
"build_scorer": 38310,
"advance": 0
}
}
],
"rewrite_time": 1067,
"collector": [
{
"name": "GlobalAggregator: [global_agg]",
"reason": "aggregation_global",
"time": "0.1226310000ms"
}
]
}
]
}
]
}
}
As you can see, the output is significantly verbose from before. All the major portions of the query are represented:
-
The first
TermQuery(message:search) represents the maintermquery -
The second
TermQuery(my_field:foo) represents thepost_filterquery -
There is a
MatchAllDocsQuery(*:*) query which is being executed as a second, distinct search. This was not part of the query specified by the user, but is auto-generated by the global aggregation to provide a global query scope
The Collector tree is fairly straightforward, showing how a single MultiCollector wraps a FilteredCollector to execute the post_filter (and in turn wraps the normal scoring SimpleCollector), a BucketCollector to run all scoped aggregations. In the MatchAll search, there is a single GlobalAggregator to run the global aggregation.
45.5. Performance Notes
Like any profiler, the Profile API introduce a non-negligible overhead to query execution. The act of instrumenting
low-level method calls such as advance and next_doc can be fairly expensive, since these methods are called
in tight loops. Therefore, profiling should not be enabled in production settings by default, and should not
be compared against non-profiled query times. Profiling is just a diagnostic tool.
There are also cases where special Lucene optimizations are disabled, since they are not amenable to profiling. This could cause some queries to report larger relative times than their non-profiled counterparts, but in general should not have a drastic effect compared to other components in the profiled query.
45.6. Limitations
-
Profiling statistics are currently not available for suggestions, highlighting,
dfs_query_then_fetch -
Detailed breakdown for aggregations is not currently available past the high-level overview provided from the Collectors
-
The Profiler is still highly experimental. The Profiler is instrumenting parts of Lucene that were never designed to be exposed in this manner, and so all results should be viewed as a best effort to provide detailed diagnostics. We hope to improve this over time. If you find obviously wrong numbers, strange query structures or other bugs, please report them!
45.7. Understanding MultiTermQuery output
A special note needs to be made about the MultiTermQuery class of queries. This includes wildcards, regex and fuzzy
queries. These queries emit very verbose responses, and are not overly structured.
Essentially, these queries rewrite themselves on a per-segment basis. If you imagine the wildcard query b*, it technically
can match any token that begins with the letter "b". It would be impossible to enumerate all possible combinations,
so Lucene rewrites the query in context of the segment being evaluated. E.g. one segment may contain the tokens
[bar, baz], so the query rewrites to a BooleanQuery combination of "bar" and "baz". Another segment may only have the
token [bakery], so query rewrites to a single TermQuery for "bakery".
Due to this dynamic, per-segment rewriting, the clean tree structure becomes distorted and no longer follows a clean "lineage" showing how one query rewrites into the next. At present time, all we can do is apologize, and suggest you collapse the details for that query’s children if it is too confusing. Luckily, all the timing statistics are correct, just not the physical layout in the response, so it is sufficient to just analyze the top-level MultiTermQuery and ignore it’s children if you find the details too tricky to interpret.
Hopefully this will be fixed in future iterations, but it is a tricky problem to solve and still in-progress :)
46. Percolator
|
|
Percolating geo-queries in Elasticsearch 2.2.0 or later
The new See Percolating geo-queries in Elasticsearch 2.2.0 and later for a workaround. |
Traditionally you design documents based on your data, store them into an index, and then define queries via the search API in order to retrieve these documents. The percolator works in the opposite direction. First you store queries into an index and then, via the percolate API, you define documents in order to retrieve these queries.
The reason that queries can be stored comes from the fact that in Elasticsearch both documents and queries are defined in JSON. This allows you to embed queries into documents via the index API. Elasticsearch can extract the query from a document and make it available to the percolate API. Since documents are also defined as JSON, you can define a document in a request to the percolate API.
The percolator and most of its features work in realtime, so once a percolate query is indexed it can immediately be used in the percolate API.
|
|
Fields referred to in a percolator query must already exist in the mapping associated with the index used for percolation. There are two ways to make sure that a field mapping exist:
|
Sample Usage
Create an index with a mapping for the field message:
curl -XPUT 'localhost:9200/my-index' -d '{
"mappings": {
"my-type": {
"properties": {
"message": {
"type": "string"
}
}
}
}
}'
Register a query in the percolator:
curl -XPUT 'localhost:9200/my-index/.percolator/1' -d '{
"query" : {
"match" : {
"message" : "bonsai tree"
}
}
}'
Match a document to the registered percolator queries:
curl -XGET 'localhost:9200/my-index/my-type/_percolate' -d '{
"doc" : {
"message" : "A new bonsai tree in the office"
}
}'
The above request will yield the following response:
{
"took" : 19,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"total" : 1,
"matches" : [
{
"_index" : "my-index",
"_id" : "1"
}
]
}
The percolate query with id 1 matches our document. |
Indexing Percolator Queries
Percolate queries are stored as documents in a specific format and in an arbitrary index under a reserved type with the
name .percolator. The query itself is placed as is in a JSON object under the top level field query.
{
"query" : {
"match" : {
"field" : "value"
}
}
}
Since this is just an ordinary document, any field can be added to this document. This can be useful later on to only percolate documents by specific queries.
{
"query" : {
"match" : {
"field" : "value"
}
},
"priority" : "high"
}
On top of this, also a mapping type can be associated with this query. This allows to control how certain queries
like range queries, shape filters, and other query & filters that rely on mapping settings get constructed. This is
important since the percolate queries are indexed into the .percolator type, and the queries / filters that rely on
mapping settings would yield unexpected behaviour. Note: By default, field names do get resolved in a smart manner,
but in certain cases with multiple types this can lead to unexpected behavior, so being explicit about it will help.
{
"query" : {
"range" : {
"created_at" : {
"gte" : "2010-01-01T00:00:00",
"lte" : "2011-01-01T00:00:00"
}
}
},
"type" : "tweet",
"priority" : "high"
}
In the above example the range query really gets parsed into a Lucene numeric range query, based on the settings for
the field created_at in the type tweet.
Just as with any other type, the .percolator type has a mapping, which you can configure via the mappings APIs.
The default percolate mapping doesn’t index the query field, only stores it.
Because .percolate is a type it also has a mapping. By default the following mapping is active:
{
".percolator" : {
"properties" : {
"query" : {
"type" : "object",
"enabled" : false
}
}
}
}
If needed, this mapping can be modified with the update mapping API.
In order to un-register a percolate query the delete API can be used. So if the previous added query needs to be deleted the following delete requests needs to be executed:
curl -XDELETE localhost:9200/my-index/.percolator/1
Percolate API
The percolate API executes in a distributed manner, meaning it executes on all shards an index points to.
-
index- The index that contains the.percolatortype. This can also be an alias. -
type- The type of the document to be percolated. The mapping of that type is used to parse document. -
doc- The actual document to percolate. Unlike the other two options this needs to be specified in the request body. Note: This isn’t required when percolating an existing document.
curl -XGET 'localhost:9200/twitter/tweet/_percolate' -d '{
"doc" : {
"created_at" : "2010-10-10T00:00:00",
"message" : "some text"
}
}'
-
routing- In case the percolate queries are partitioned by a custom routing value, that routing option makes sure that the percolate request only gets executed on the shard where the routing value is partitioned to. This means that the percolate request only gets executed on one shard instead of all shards. Multiple values can be specified as a comma separated string, in that case the request can be be executed on more than one shard. -
preference- Controls which shard replicas are preferred to execute the request on. Works the same as in the search API. -
ignore_unavailable- Controls if missing concrete indices should silently be ignored. Same as is in the search API. -
percolate_format- Ifidsis specified then the matches array in the percolate response will contain a string array of the matching ids instead of an array of objects. This can be useful to reduce the amount of data being send back to the client. Obviously if there are two percolator queries with same id from different indices there is no way to find out which percolator query belongs to what index. Any other value topercolate_formatwill be ignored.
-
filter- Reduces the number queries to execute during percolating. Only the percolator queries that match with the filter will be included in the percolate execution. The filter option works in near realtime, so a refresh needs to have occurred for the filter to included the latest percolate queries. -
query- Same as thefilteroption, but also the score is computed. The computed scores can then be used by thetrack_scoresandsortoption. -
size- Defines to maximum number of matches (percolate queries) to be returned. Defaults to unlimited. -
track_scores- Whether the_scoreis included for each match. The_scoreis based on the query and represents how the query matched the percolate query’s metadata, not how the document (that is being percolated) matched the query. Thequeryoption is required for this option. Defaults tofalse. -
sort- Define a sort specification like in the search API. Currently only sorting_scorereverse (default relevancy) is supported. Other sort fields will throw an exception. Thesizeandqueryoption are required for this setting. Liketrack_scorethe score is based on the query and represents how the query matched to the percolate query’s metadata and not how the document being percolated matched to the query. -
aggs- Allows aggregation definitions to be included. The aggregations are based on the matching percolator queries, look at the aggregation documentation on how to define aggregations. -
highlight- Allows highlight definitions to be included. The document being percolated is being highlight for each matching query. This allows you to see how each match is highlighting the document being percolated. See highlight documentation on how to define highlights. Thesizeoption is required for highlighting, the performance of highlighting in the percolate API depends of how many matches are being highlighted.
Dedicated Percolator Index
Percolate queries can be added to any index. Instead of adding percolate queries to the index the data resides in, these queries can also be added to a dedicated index. The advantage of this is that this dedicated percolator index can have its own index settings (For example the number of primary and replica shards). If you choose to have a dedicated percolate index, you need to make sure that the mappings from the normal index are also available on the percolate index. Otherwise percolate queries can be parsed incorrectly.
Filtering Executed Queries
Filtering allows to reduce the number of queries, any filter that the search API supports, (except the ones mentioned in important notes)
can also be used in the percolate API. The filter only works on the metadata fields. The query field isn’t indexed by
default. Based on the query we indexed before, the following filter can be defined:
curl -XGET localhost:9200/test/type1/_percolate -d '{
"doc" : {
"field" : "value"
},
"filter" : {
"term" : {
"priority" : "high"
}
}
}'
Percolator Count API
The count percolate API, only keeps track of the number of matches and doesn’t keep track of the actual matches Example:
curl -XGET 'localhost:9200/my-index/my-type/_percolate/count' -d '{
"doc" : {
"message" : "some message"
}
}'
Response:
{
... // header
"total" : 3
}
Percolating an Existing Document
In order to percolate a newly indexed document, the percolate existing document can be used. Based on the response
from an index request, the _id and other meta information can be used to immediately percolate the newly added
document.
-
id- The id of the document to retrieve the source for. -
percolate_index- The index containing the percolate queries. Defaults to theindexdefined in the url. -
percolate_type- The percolate type (used for parsing the document). Default totypedefined in the url. -
routing- The routing value to use when retrieving the document to percolate. -
preference- Which shard to prefer when retrieving the existing document. -
percolate_routing- The routing value to use when percolating the existing document. -
percolate_preference- Which shard to prefer when executing the percolate request. -
version- Enables a version check. If the fetched document’s version isn’t equal to the specified version then the request fails with a version conflict and the percolation request is aborted.
Internally the percolate API will issue a GET request for fetching the _source of the document to percolate.
For this feature to work, the _source for documents to be percolated needs to be stored.
Example
Index response:
{
"_index" : "my-index",
"_type" : "message",
"_id" : "1",
"_version" : 1,
"created" : true
}
Percolating an Existing Document:
curl -XGET 'localhost:9200/my-index1/message/1/_percolate'
The response is the same as with the regular percolate API.
Multi Percolate API
The multi percolate API allows to bundle multiple percolate requests into a single request, similar to what the multi search API does to search requests. The request body format is line based. Each percolate request item takes two lines, the first line is the header and the second line is the body.
The header can contain any parameter that normally would be set via the request path or query string parameters. There are several percolate actions, because there are multiple types of percolate requests.
-
percolate- Action for defining a regular percolate request. -
count- Action for defining a count percolate request.
Depending on the percolate action different parameters can be specified. For example the percolate and percolate existing document actions support different parameters.
-
GET|POST /[index]/[type]/_mpercolate -
GET|POST /[index]/_mpercolate -
GET|POST /_mpercolate
The index and type defined in the url path are the default index and type.
Example
Request:
curl -XGET 'localhost:9200/twitter/tweet/_mpercolate' --data-binary "@requests.txt"; echo
The index twitter is the default index, and the type tweet is the default type and will be used in the case a header
doesn’t specify an index or type.
requests.txt:
{"percolate" : {"index" : "twitter", "type" : "tweet"}}
{"doc" : {"message" : "some text"}}
{"percolate" : {"index" : "twitter", "type" : "tweet", "id" : "1"}}
{}
{"percolate" : {"index" : "users", "type" : "user", "id" : "3", "percolate_index" : "users_2012" }}
{"size" : 10}
{"count" : {"index" : "twitter", "type" : "tweet"}}
{"doc" : {"message" : "some other text"}}
{"count" : {"index" : "twitter", "type" : "tweet", "id" : "1"}}
{}
For a percolate existing document item (headers with the id field), the response can be an empty JSON object.
All the required options are set in the header.
Response:
{
"responses" : [
{
"took" : 24,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0,
},
"total" : 3,
"matches" : [
{
"_index": "twitter",
"_id": "1"
},
{
"_index": "twitter",
"_id": "2"
},
{
"_index": "twitter",
"_id": "3"
}
]
},
{
"took" : 12,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0,
},
"total" : 3,
"matches" : [
{
"_index": "twitter",
"_id": "4"
},
{
"_index": "twitter",
"_id": "5"
},
{
"_index": "twitter",
"_id": "6"
}
]
},
{
"error" : "DocumentMissingException[[_na][_na] [user][3]: document missing]"
},
{
"took" : 12,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0,
},
"total" : 3
},
{
"took" : 14,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0,
},
"total" : 3
}
]
}
Each item represents a percolate response, the order of the items maps to the order in which the percolate requests were specified. In case a percolate request failed, the item response is substituted with an error message.
How it Works Under the Hood
When indexing a document that contains a query in an index and the .percolator type, the query part of the documents gets
parsed into a Lucene query and is kept in memory until that percolator document is removed or the index containing the
.percolator type gets removed. So, all the active percolator queries are kept in memory.
At percolate time, the document specified in the request gets parsed into a Lucene document and is stored in a in-memory Lucene index. This in-memory index can just hold this one document and it is optimized for that. Then all the queries that are registered to the index that the percolate request is targeted for, are going to be executed on this single document in-memory index. This happens on each shard the percolate request needs to execute.
By using routing, filter or query features the amount of queries that need to be executed can be reduced and thus
the time the percolate API needs to run can be decreased.
Important Notes
Because the percolator API is processing one document at a time, it doesn’t support queries and filters that run
against child documents such as has_child and has_parent.
The inner_hits feature on the nested query isn’t supported in the percolate api.
The wildcard and regexp query natively use a lot of memory and because the percolator keeps the queries into memory
this can easily take up the available memory in the heap space. If possible try to use a prefix query or ngramming to
achieve the same result (with way less memory being used).
The delete-by-query plugin doesn’t work to unregister a query, it only deletes the percolate documents from disk. In order
to update the registered queries in memory the index needs be closed and opened.
Forcing Unmapped Fields to be Handled as Strings
In certain cases it is unknown what kind of percolator queries do get registered, and if no field mapping exists for fields
that are referred by percolator queries then adding a percolator query fails. This means the mapping needs to be updated
to have the field with the appropriate settings, and then the percolator query can be added. But sometimes it is sufficient
if all unmapped fields are handled as if these were default string fields. In those cases one can configure the
index.percolator.map_unmapped_fields_as_string setting to true (default to false) and then if a field referred in
a percolator query does not exist, it will be handled as a default string field so that adding the percolator query doesn’t
fail.
Percolating geo-queries in Elasticsearch 2.2.0 and later
The new geo_point fields added in Elasticsearch 2.2.0 and
above require that doc_values are enabled in order to
function. Unfortunately, the in-memory index used by the percolator does not
yet have support for doc_values, meaning that geo-queries
will not work in a percolator index created in Elasticsearch 2.2.0 or later.
A workaround exists which allows you to both benefit from the new geo_point
field format when searching or aggregating, and to use geo-queries in the
percolator.
Documents with `geo_point`fields should be indexed into a new index created in Elasticsearch 2.2.0 or later for searching or aggregations:
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"location": {
"type": "geo_point"
}
}
}
}
}
PUT my_index/my_type/1
{
"location": {
"lat": 0,
"lon": 0
}
}
Percolator queries should be created in a separate dedicated percolator index, which claims to have been created in Elasticsearch 2.1.0:
PUT my_percolator_index
{
"settings": {
"index.version.created": 2010299
},
"mappings": {
"my_type": {
"properties": {
"location": {
"type": "geo_point"
}
}
}
}
}
PUT my_percolator_index/.percolator/my_geo_query
{
"query": {
"geo_distance": {
"distance": "5km",
"location": {
"lat": 0,
"lon": 0
}
}
}
}
| Marks the index as having been created in Elasticsearch 2.1.0 |
With this setup, you can percolate an inline document as follows:
GET my_percolator_index/my_type/_percolate
{
"doc": {
"location": {
"lat": 0,
"lon": 0
}
}
}
or percolate a document already indexed in my_index as follows:
GET my_index/my_type/1/_percolate?percolate_index=my_percolator_index
47. Field stats API
experimental[]
The field stats api allows one to find statistical properties of a field without executing a search, but looking up measurements that are natively available in the Lucene index. This can be useful to explore a dataset which you don’t know much about. For example, this allows creating a histogram aggregation with meaningful intervals based on the min/max range of values.
The field stats api by defaults executes on all indices, but can execute on specific indices too.
All indices:
curl -XGET "http://localhost:9200/_field_stats?fields=rating"
Specific indices:
curl -XGET "http://localhost:9200/index1,index2/_field_stats?fields=rating"
Supported request options:
fields
|
A list of fields to compute stats for. |
level
|
Defines if field stats should be returned on a per index level or on a
cluster wide level. Valid values are |
Alternatively the fields option can also be defined in the request body:
curl -XPOST "http://localhost:9200/_field_stats?level=indices" -d '{
"fields" : ["rating"]
}'
This is equivalent to the previous request.
Field statistics
The field stats api is supported on string based, number based and date based fields and can return the following statistics per field:
max_doc
|
The total number of documents. |
doc_count
|
The number of documents that have at least one term for this field, or -1 if this measurement isn’t available on one or more shards. |
density
|
The percentage of documents that have at least one value for this field. This
is a derived statistic and is based on the |
sum_doc_freq
|
The sum of each term’s document frequency in this field, or -1 if this measurement isn’t available on one or more shards. Document frequency is the number of documents containing a particular term. |
sum_total_term_freq
|
The sum of the term frequencies of all terms in this field across all documents, or -1 if this measurement isn’t available on one or more shards. Term frequency is the total number of occurrences of a term in a particular document and field. |
min_value
|
The lowest value in the field. |
min_value_as_string
|
The lowest value in the field represented in a displayable form. All fields, but string fields returns this. (since string fields, represent values already as strings) |
max_value
|
The highest value in the field. |
max_value_as_string
|
The highest value in the field represented in a displayable form. All fields, but string fields returns this. (since string fields, represent values already as strings) |
|
|
Documents marked as deleted (but not yet removed by the merge process) still affect all the mentioned statistics. |
Cluster level field statistics example
Request:
curl -XGET "http://localhost:9200/_field_stats?fields=rating,answer_count,creation_date,display_name"
Response:
{
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"indices": {
"_all": {
"fields": {
"creation_date": {
"max_doc": 1326564,
"doc_count": 564633,
"density": 42,
"sum_doc_freq": 2258532,
"sum_total_term_freq": -1,
"min_value": "2008-08-01T16:37:51.513Z",
"max_value": "2013-06-02T03:23:11.593Z"
},
"display_name": {
"max_doc": 1326564,
"doc_count": 126741,
"density": 9,
"sum_doc_freq": 166535,
"sum_total_term_freq": 166616,
"min_value": "0",
"max_value": "정혜선"
},
"answer_count": {
"max_doc": 1326564,
"doc_count": 139885,
"density": 10,
"sum_doc_freq": 559540,
"sum_total_term_freq": -1,
"min_value": 0,
"max_value": 160
},
"rating": {
"max_doc": 1326564,
"doc_count": 437892,
"density": 33,
"sum_doc_freq": 1751568,
"sum_total_term_freq": -1,
"min_value": -14,
"max_value": 1277
}
}
}
}
}
The _all key indicates that it contains the field stats of all indices in the cluster. |
Indices level field statistics example
Request:
curl -XGET "http://localhost:9200/_field_stats?fields=rating,answer_count,creation_date,display_name&level=indices"
Response:
{
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"indices": {
"stack": {
"fields": {
"creation_date": {
"max_doc": 1326564,
"doc_count": 564633,
"density": 42,
"sum_doc_freq": 2258532,
"sum_total_term_freq": -1,
"min_value": "2008-08-01T16:37:51.513Z",
"max_value": "2013-06-02T03:23:11.593Z"
},
"display_name": {
"max_doc": 1326564,
"doc_count": 126741,
"density": 9,
"sum_doc_freq": 166535,
"sum_total_term_freq": 166616,
"min_value": "0",
"max_value": "정혜선"
},
"answer_count": {
"max_doc": 1326564,
"doc_count": 139885,
"density": 10,
"sum_doc_freq": 559540,
"sum_total_term_freq": -1,
"min_value": 0,
"max_value": 160
},
"rating": {
"max_doc": 1326564,
"doc_count": 437892,
"density": 33,
"sum_doc_freq": 1751568,
"sum_total_term_freq": -1,
"min_value": -14,
"max_value": 1277
}
}
}
}
}
The stack key means it contains all field stats for the stack index. |
Field stats index constraints
Field stats index constraints allows to omit all field stats for indices that don’t match with the constraint. An index
constraint can exclude indices' field stats based on the min_value and max_value statistic. This option is only
useful if the level option is set to indices.
For example index constraints can be useful to find out the min and max value of a particular property of your data in
a time based scenario. The following request only returns field stats for the answer_count property for indices
holding questions created in the year 2014:
curl -XPOST "http://localhost:9200/_field_stats?level=indices" -d '{
"fields" : ["answer_count"]
"index_constraints" : {
"creation_date" : {
"min_value" : {
"gte" : "2014-01-01T00:00:00.000Z"
},
"max_value" : {
"lt" : "2015-01-01T00:00:00.000Z"
}
}
}
}'
| The fields to compute and return field stats for. | |
The set index constraints. Note that index constrains can be defined for fields that aren’t defined in the fields option. |
|
Index constraints for the field creation_date. |
|
An index constraint on the min_value property of a field statistic. |
For a field, index constraints can be defined on the min_value statistic, max_value statistic or both.
Each index constraint support the following comparisons:
gte
|
Greater-than or equal to |
gt
|
Greater-than |
lte
|
Less-than or equal to |
lt
|
Less-than |
Field stats index constraints on date fields optionally accept a format option, used to parse the constraint’s value.
If missing, the format configured in the field’s mapping is used.
curl -XPOST "http://localhost:9200/_field_stats?level=indices" -d '{
"fields" : ["answer_count"]
"index_constraints" : {
"creation_date" : {
"min_value" : {
"gte" : "2014-01-01",
"format" : "date_optional_time"
},
"max_value" : {
"lt" : "2015-01-01",
"format" : "date_optional_time"
}
}
}
}'
| Custom date format |
Aggregations
The aggregations framework helps provide aggregated data based on a search query. It is based on simple building blocks called aggregations, that can be composed in order to build complex summaries of the data.
An aggregation can be seen as a unit-of-work that builds analytic information over a set of documents. The context of the execution defines what this document set is (e.g. a top-level aggregation executes within the context of the executed query/filters of the search request).
There are many different types of aggregations, each with its own purpose and output. To better understand these types, it is often easier to break them into three main families:
- Bucketing
-
A family of aggregations that build buckets, where each bucket is associated with a key and a document criterion. When the aggregation is executed, all the buckets criteria are evaluated on every document in the context and when a criterion matches, the document is considered to "fall in" the relevant bucket. By the end of the aggregation process, we’ll end up with a list of buckets - each one with a set of documents that "belong" to it.
- Metric
-
Aggregations that keep track and compute metrics over a set of documents.
- Pipeline
-
Aggregations that aggregate the output of other aggregations and their associated metrics
The interesting part comes next. Since each bucket effectively defines a document set (all documents belonging to the bucket), one can potentially associate aggregations on the bucket level, and those will execute within the context of that bucket. This is where the real power of aggregations kicks in: aggregations can be nested!
|
|
Bucketing aggregations can have sub-aggregations (bucketing or metric). The sub-aggregations will be computed for the buckets which their parent aggregation generates. There is no hard limit on the level/depth of nested aggregations (one can nest an aggregation under a "parent" aggregation, which is itself a sub-aggregation of another higher-level aggregation). |
Structuring Aggregations
The following snippet captures the basic structure of aggregations:
"aggregations" : {
"<aggregation_name>" : {
"<aggregation_type>" : {
<aggregation_body>
}
[,"meta" : { [<meta_data_body>] } ]?
[,"aggregations" : { [<sub_aggregation>]+ } ]?
}
[,"<aggregation_name_2>" : { ... } ]*
}
The aggregations object (the key aggs can also be used) in the JSON holds the aggregations to be computed. Each aggregation
is associated with a logical name that the user defines (e.g. if the aggregation computes the average price, then it would
make sense to name it avg_price). These logical names will also be used to uniquely identify the aggregations in the
response. Each aggregation has a specific type (<aggregation_type> in the above snippet) and is typically the first
key within the named aggregation body. Each type of aggregation defines its own body, depending on the nature of the
aggregation (e.g. an avg aggregation on a specific field will define the field on which the average will be calculated).
At the same level of the aggregation type definition, one can optionally define a set of additional aggregations,
though this only makes sense if the aggregation you defined is of a bucketing nature. In this scenario, the
sub-aggregations you define on the bucketing aggregation level will be computed for all the buckets built by the
bucketing aggregation. For example, if you define a set of aggregations under the range aggregation, the
sub-aggregations will be computed for the range buckets that are defined.
Values Source
Some aggregations work on values extracted from the aggregated documents. Typically, the values will be extracted from
a specific document field which is set using the field key for the aggregations. It is also possible to define a
script which will generate the values (per document).
When both field and script settings are configured for the aggregation, the script will be treated as a
value script. While normal scripts are evaluated on a document level (i.e. the script has access to all the data
associated with the document), value scripts are evaluated on the value level. In this mode, the values are extracted
from the configured field and the script is used to apply a "transformation" over these value/s.
|
|
When working with scripts, the |
Scripts can generate a single value or multiple values per document. When generating multiple values, one can use the
script_values_sorted settings to indicate whether these values are sorted or not. Internally, Elasticsearch can
perform optimizations when dealing with sorted values (for example, with the min aggregations, knowing the values are
sorted, Elasticsearch will skip the iterations over all the values and rely on the first value in the list to be the
minimum value among all other values associated with the same document).
48. Metrics Aggregations
The aggregations in this family compute metrics based on values extracted in one way or another from the documents that are being aggregated. The values are typically extracted from the fields of the document (using the field data), but can also be generated using scripts.
Numeric metrics aggregations are a special type of metrics aggregation which output numeric values. Some aggregations output
a single numeric metric (e.g. avg) and are called single-value numeric metrics aggregation, others generate multiple
metrics (e.g. stats) and are called multi-value numeric metrics aggregation. The distinction between single-value and
multi-value numeric metrics aggregations plays a role when these aggregations serve as direct sub-aggregations of some
bucket aggregations (some bucket aggregations enable you to sort the returned buckets based on the numeric metrics in each bucket).
48.1. Avg Aggregation
A single-value metrics aggregation that computes the average of numeric values that are extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script.
Assuming the data consists of documents representing exams grades (between 0 and 100) of students
{
"aggs" : {
"avg_grade" : { "avg" : { "field" : "grade" } }
}
}
The above aggregation computes the average grade over all documents. The aggregation type is avg and the field setting defines the numeric field of the documents the average will be computed on. The above will return the following:
{
...
"aggregations": {
"avg_grade": {
"value": 75
}
}
}
The name of the aggregation (avg_grade above) also serves as the key by which the aggregation result can be retrieved from the returned response.
48.1.1. Script
Computing the average grade based on a script:
{
...,
"aggs" : {
"avg_grade" : { "avg" : { "script" : "doc['grade'].value" } }
}
}
This will interpret the script parameter as an inline script with the default script language and no script parameters. To use a file script use the following syntax:
{
...,
"aggs" : {
"avg_grade" : {
"avg" : {
"script" : {
"file": "my_script",
"params": {
"field": "grade"
}
}
}
}
}
}
|
|
for indexed scripts replace the file parameter with an id parameter.
|
Value Script
It turned out that the exam was way above the level of the students and a grade correction needs to be applied. We can use value script to get the new average:
{
"aggs" : {
...
"aggs" : {
"avg_corrected_grade" : {
"avg" : {
"field" : "grade",
"script" : {
"inline": "_value * correction",
"params" : {
"correction" : 1.2
}
}
}
}
}
}
}
48.1.2. Missing value
The missing parameter defines how documents that are missing a value should be treated.
By default they will be ignored but it is also possible to treat them as if they
had a value.
{
"aggs" : {
"grade_avg" : {
"avg" : {
"field" : "grade",
"missing": 10
}
}
}
}
Documents without a value in the grade field will fall into the same bucket as documents that have the value 10. |
48.2. Cardinality Aggregation
A single-value metrics aggregation that calculates an approximate count of
distinct values. Values can be extracted either from specific fields in the
document or generated by a script.
Assume you are indexing books and would like to count the unique authors that match a query:
{
"aggs" : {
"author_count" : {
"cardinality" : {
"field" : "author"
}
}
}
}
48.2.1. Precision control
This aggregation also supports the precision_threshold option:
experimental[The precision_threshold option is specific to the current internal implementation of the cardinality agg, which may change in the future]
{
"aggs" : {
"author_count" : {
"cardinality" : {
"field" : "author_hash",
"precision_threshold": 100
}
}
}
}
The precision_threshold options allows to trade memory for accuracy, and
defines a unique count below which counts are expected to be close to
accurate. Above this value, counts might become a bit more fuzzy. The maximum
supported value is 40000, thresholds above this number will have the same
effect as a threshold of 40000.
Default value depends on the number of parent aggregations that multiple
create buckets (such as terms or histograms). |
48.2.2. Counts are approximate
Computing exact counts requires loading values into a hash set and returning its size. This doesn’t scale when working on high-cardinality sets and/or large values as the required memory usage and the need to communicate those per-shard sets between nodes would utilize too many resources of the cluster.
This cardinality aggregation is based on the
HyperLogLog++
algorithm, which counts based on the hashes of the values with some interesting
properties:
-
configurable precision, which decides on how to trade memory for accuracy,
-
excellent accuracy on low-cardinality sets,
-
fixed memory usage: no matter if there are tens or billions of unique values, memory usage only depends on the configured precision.
For a precision threshold of c, the implementation that we are using requires
about c * 8 bytes.
The following chart shows how the error varies before and after the threshold:

For all 3 thresholds, counts have been accurate up to the configured threshold (although not guaranteed, this is likely to be the case). Please also note that even with a threshold as low as 100, the error remains under 5%, even when counting millions of items.
48.2.3. Pre-computed hashes
On string fields that have a high cardinality, it might be faster to store the
hash of your field values in your index and then run the cardinality aggregation
on this field. This can either be done by providing hash values from client-side
or by letting elasticsearch compute hash values for you by using the
mapper-murmur3 plugin.
|
|
Pre-computing hashes is usually only useful on very large and/or high-cardinality fields as it saves CPU and memory. However, on numeric fields, hashing is very fast and storing the original values requires as much or less memory than storing the hashes. This is also true on low-cardinality string fields, especially given that those have an optimization in order to make sure that hashes are computed at most once per unique value per segment. |
48.2.4. Script
The cardinality metric supports scripting, with a noticeable performance hit
however since hashes need to be computed on the fly.
{
"aggs" : {
"author_count" : {
"cardinality" : {
"script": "doc['author.first_name'].value + ' ' + doc['author.last_name'].value"
}
}
}
}
This will interpret the script parameter as an inline script with the default script language and no script parameters. To use a file script use the following syntax:
{
"aggs" : {
"author_count" : {
"cardinality" : {
"script" : {
"file": "my_script",
"params": {
"first_name_field": "author.first_name",
"last_name_field": "author.last_name"
}
}
}
}
}
}
|
|
for indexed scripts replace the file parameter with an id parameter.
|
48.2.5. Missing value
The missing parameter defines how documents that are missing a value should be treated.
By default they will be ignored but it is also possible to treat them as if they
had a value.
{
"aggs" : {
"tag_cardinality" : {
"cardinality" : {
"field" : "tag",
"missing": "N/A"
}
}
}
}
Documents without a value in the tag field will fall into the same bucket as documents that have the value N/A. |
48.3. Extended Stats Aggregation
A multi-value metrics aggregation that computes stats over numeric values extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script.
The extended_stats aggregations is an extended version of the stats aggregation, where additional metrics are added such as sum_of_squares, variance, std_deviation and std_deviation_bounds.
Assuming the data consists of documents representing exams grades (between 0 and 100) of students
{
"aggs" : {
"grades_stats" : { "extended_stats" : { "field" : "grade" } }
}
}
The above aggregation computes the grades statistics over all documents. The aggregation type is extended_stats and the field setting defines the numeric field of the documents the stats will be computed on. The above will return the following:
{
...
"aggregations": {
"grade_stats": {
"count": 9,
"min": 72,
"max": 99,
"avg": 86,
"sum": 774,
"sum_of_squares": 67028,
"variance": 51.55555555555556,
"std_deviation": 7.180219742846005,
"std_deviation_bounds": {
"upper": 100.36043948569201,
"lower": 71.63956051430799
}
}
}
}
The name of the aggregation (grades_stats above) also serves as the key by which the aggregation result can be retrieved from the returned response.
48.3.1. Standard Deviation Bounds
By default, the extended_stats metric will return an object called std_deviation_bounds, which provides an interval of plus/minus two standard
deviations from the mean. This can be a useful way to visualize variance of your data. If you want a different boundary, for example
three standard deviations, you can set sigma in the request:
{
"aggs" : {
"grades_stats" : {
"extended_stats" : {
"field" : "grade",
"sigma" : 3
}
}
}
}
sigma controls how many standard deviations +/- from the mean should be displayed |
sigma can be any non-negative double, meaning you can request non-integer values such as 1.5. A value of 0 is valid, but will simply
return the average for both upper and lower bounds.
|
|
Standard Deviation and Bounds require normality
The standard deviation and its bounds are displayed by default, but they are not always applicable to all data-sets. Your data must be normally distributed for the metrics to make sense. The statistics behind standard deviations assumes normally distributed data, so if your data is skewed heavily left or right, the value returned will be misleading. |
48.3.2. Script
Computing the grades stats based on a script:
{
...,
"aggs" : {
"grades_stats" : { "extended_stats" : { "script" : "doc['grade'].value" } }
}
}
This will interpret the script parameter as an inline script with the default script language and no script parameters. To use a file script use the following syntax:
{
...,
"aggs" : {
"grades_stats" : {
"extended_stats" : {
"script" : {
"file": "my_script",
"params": {
"field": "grade"
}
}
}
}
}
}
|
|
for indexed scripts replace the file parameter with an id parameter.
|
Value Script
It turned out that the exam was way above the level of the students and a grade correction needs to be applied. We can use value script to get the new stats:
{
"aggs" : {
...
"aggs" : {
"grades_stats" : {
"extended_stats" : {
"field" : "grade",
"script" : {
"inline": "_value * correction",
"params" : {
"correction" : 1.2
}
}
}
}
}
}
}
48.3.3. Missing value
The missing parameter defines how documents that are missing a value should be treated.
By default they will be ignored but it is also possible to treat them as if they
had a value.
{
"aggs" : {
"grades_stats" : {
"extended_stats" : {
"field" : "grade",
"missing": 0
}
}
}
}
Documents without a value in the grade field will fall into the same bucket as documents that have the value 0. |
48.4. Geo Bounds Aggregation
A metric aggregation that computes the bounding box containing all geo_point values for a field.
Example:
{
"query" : {
"match" : { "business_type" : "shop" }
},
"aggs" : {
"viewport" : {
"geo_bounds" : {
"field" : "location",
"wrap_longitude" : true
}
}
}
}
The geo_bounds aggregation specifies the field to use to obtain the bounds |
|
wrap_longitude is an optional parameter which specifies whether the bounding box should be allowed to overlap the international date line. The default value is true |
The above aggregation demonstrates how one would compute the bounding box of the location field for all documents with a business type of shop
The response for the above aggregation:
{
...
"aggregations": {
"viewport": {
"bounds": {
"top_left": {
"lat": 80.45,
"lon": -160.22
},
"bottom_right": {
"lat": 40.65,
"lon": 42.57
}
}
}
}
}
48.5. Geo Centroid Aggregation
A metric aggregation that computes the weighted centroid from all coordinate values for a Geo-point datatype field.
Example:
{
"query" : {
"match" : { "crime" : "burglary" }
},
"aggs" : {
"centroid" : {
"geo_centroid" : {
"field" : "location"
}
}
}
}
The geo_centroid aggregation specifies the field to use for computing the centroid. (NOTE: field must be a Geo-point datatype type) |
The above aggregation demonstrates how one would compute the centroid of the location field for all documents with a crime type of burglary
The response for the above aggregation:
{
...
"aggregations": {
"centroid": {
"location": {
"lat": 80.45,
"lon": -160.22
}
}
}
}
The geo_centroid aggregation is more interesting when combined as a sub-aggregation to other bucket aggregations.
Example:
{
"query" : {
"match" : { "crime" : "burglary" }
},
"aggs" : {
"towns" : {
"terms" : { "field" : "town" },
"aggs" : {
"centroid" : {
"geo_centroid" : { "field" : "location" }
}
}
}
}
}
The above example uses geo_centroid as a sub-aggregation to a terms bucket aggregation
for finding the central location for all crimes of type burglary in each town.
The response for the above aggregation:
{
...
"buckets": [
{
"key": "Los Altos",
"doc_count": 113,
"centroid": {
"location": {
"lat": 37.3924582824111,
"lon": -122.12104808539152
}
}
},
{
"key": "Mountain View",
"doc_count": 92,
"centroid": {
"location": {
"lat": 37.382152481004596,
"lon": -122.08116559311748
}
}
}
]
}
48.6. Max Aggregation
A single-value metrics aggregation that keeps track and returns the maximum value among the numeric values extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script.
Computing the max price value across all documents
{
"aggs" : {
"max_price" : { "max" : { "field" : "price" } }
}
}
Response:
{
...
"aggregations": {
"max_price": {
"value": 35
}
}
}
As can be seen, the name of the aggregation (max_price above) also serves as the key by which the aggregation result can be retrieved from the returned response.
48.6.1. Script
Computing the max price value across all document, this time using a script:
{
"aggs" : {
"max_price" : { "max" : { "script" : "doc['price'].value" } }
}
}
This will interpret the script parameter as an inline script with the default script language and no script parameters. To use a file script use the following syntax:
{
"aggs" : {
"max_price" : {
"max" : {
"script" : {
"file": "my_script",
"params": {
"field": "price"
}
}
}
}
}
}
|
|
for indexed scripts replace the file parameter with an id parameter.
|
48.6.2. Value Script
Let’s say that the prices of the documents in our index are in USD, but we would like to compute the max in EURO (and for the sake of this example, lets say the conversion rate is 1.2). We can use a value script to apply the conversion rate to every value before it is aggregated:
{
"aggs" : {
"max_price_in_euros" : {
"max" : {
"field" : "price",
"script" : {
"inline": "_value * conversion_rate",
"params" : {
"conversion_rate" : 1.2
}
}
}
}
}
}
48.6.3. Missing value
The missing parameter defines how documents that are missing a value should be treated.
By default they will be ignored but it is also possible to treat them as if they
had a value.
{
"aggs" : {
"grade_max" : {
"max" : {
"field" : "grade",
"missing": 10
}
}
}
}
Documents without a value in the grade field will fall into the same bucket as documents that have the value 10. |
48.7. Min Aggregation
A single-value metrics aggregation that keeps track and returns the minimum value among numeric values extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script.
Computing the min price value across all documents:
{
"aggs" : {
"min_price" : { "min" : { "field" : "price" } }
}
}
Response:
{
...
"aggregations": {
"min_price": {
"value": 10
}
}
}
As can be seen, the name of the aggregation (min_price above) also serves as the key by which the aggregation result can be retrieved from the returned response.
48.7.1. Script
Computing the min price value across all document, this time using a script:
{
"aggs" : {
"min_price" : { "min" : { "script" : "doc['price'].value" } }
}
}
This will interpret the script parameter as an inline script with the default script language and no script parameters. To use a file script use the following syntax:
{
"aggs" : {
"min_price" : {
"min" : {
"script" : {
"file": "my_script",
"params": {
"field": "price"
}
}
}
}
}
}
|
|
for indexed scripts replace the file parameter with an id parameter.
|
48.7.2. Value Script
Let’s say that the prices of the documents in our index are in USD, but we would like to compute the min in EURO (and for the sake of this example, lets say the conversion rate is 1.2). We can use a value script to apply the conversion rate to every value before it is aggregated:
{
"aggs" : {
"min_price_in_euros" : {
"min" : {
"field" : "price",
"script" :
"inline": "_value * conversion_rate",
"params" : {
"conversion_rate" : 1.2
}
}
}
}
}
}
48.7.3. Missing value
The missing parameter defines how documents that are missing a value should be treated.
By default they will be ignored but it is also possible to treat them as if they
had a value.
{
"aggs" : {
"grade_min" : {
"min" : {
"field" : "grade",
"missing": 10
}
}
}
}
Documents without a value in the grade field will fall into the same bucket as documents that have the value 10. |
48.8. Percentiles Aggregation
A multi-value metrics aggregation that calculates one or more percentiles
over numeric values extracted from the aggregated documents. These values
can be extracted either from specific numeric fields in the documents, or
be generated by a provided script.
Percentiles show the point at which a certain percentage of observed values occur. For example, the 95th percentile is the value which is greater than 95% of the observed values.
Percentiles are often used to find outliers. In normal distributions, the 0.13th and 99.87th percentiles represents three standard deviations from the mean. Any data which falls outside three standard deviations is often considered an anomaly.
When a range of percentiles are retrieved, they can be used to estimate the data distribution and determine if the data is skewed, bimodal, etc.
Assume your data consists of website load times. The average and median load times are not overly useful to an administrator. The max may be interesting, but it can be easily skewed by a single slow response.
Let’s look at a range of percentiles representing load time:
{
"aggs" : {
"load_time_outlier" : {
"percentiles" : {
"field" : "load_time"
}
}
}
}
The field load_time must be a numeric field |
By default, the percentile metric will generate a range of
percentiles: [ 1, 5, 25, 50, 75, 95, 99 ]. The response will look like this:
{
...
"aggregations": {
"load_time_outlier": {
"values" : {
"1.0": 15,
"5.0": 20,
"25.0": 23,
"50.0": 25,
"75.0": 29,
"95.0": 60,
"99.0": 150
}
}
}
}
As you can see, the aggregation will return a calculated value for each percentile in the default range. If we assume response times are in milliseconds, it is immediately obvious that the webpage normally loads in 15-30ms, but occasionally spikes to 60-150ms.
Often, administrators are only interested in outliers — the extreme percentiles. We can specify just the percents we are interested in (requested percentiles must be a value between 0-100 inclusive):
{
"aggs" : {
"load_time_outlier" : {
"percentiles" : {
"field" : "load_time",
"percents" : [95, 99, 99.9]
}
}
}
}
Use the percents parameter to specify particular percentiles to calculate |
48.8.1. Script
The percentile metric supports scripting. For example, if our load times are in milliseconds but we want percentiles calculated in seconds, we could use a script to convert them on-the-fly:
{
"aggs" : {
"load_time_outlier" : {
"percentiles" : {
"script" : {
"inline": "doc['load_time'].value / timeUnit",
"params" : {
"timeUnit" : 1000
}
}
}
}
}
}
The field parameter is replaced with a script parameter, which uses the
script to generate values which percentiles are calculated on |
|
| Scripting supports parameterized input just like any other script |
This will interpret the script parameter as an inline script with the default script language and no script parameters. To use a file script use the following syntax:
{
"aggs" : {
"load_time_outlier" : {
"percentiles" : {
"script" : {
"file": "my_script",
"params" : {
"timeUnit" : 1000
}
}
}
}
}
}
|
|
for indexed scripts replace the file parameter with an id parameter.
|
48.8.2. Percentiles are (usually) approximate
There are many different algorithms to calculate percentiles. The naive
implementation simply stores all the values in a sorted array. To find the 50th
percentile, you simply find the value that is at my_array[count(my_array) * 0.5].
Clearly, the naive implementation does not scale — the sorted array grows linearly with the number of values in your dataset. To calculate percentiles across potentially billions of values in an Elasticsearch cluster, approximate percentiles are calculated.
The algorithm used by the percentile metric is called TDigest (introduced by
Ted Dunning in
Computing Accurate Quantiles using T-Digests).
When using this metric, there are a few guidelines to keep in mind:
-
Accuracy is proportional to
q(1-q). This means that extreme percentiles (e.g. 99%) are more accurate than less extreme percentiles, such as the median -
For small sets of values, percentiles are highly accurate (and potentially 100% accurate if the data is small enough).
-
As the quantity of values in a bucket grows, the algorithm begins to approximate the percentiles. It is effectively trading accuracy for memory savings. The exact level of inaccuracy is difficult to generalize, since it depends on your data distribution and volume of data being aggregated
The following chart shows the relative error on a uniform distribution depending on the number of collected values and the requested percentile:

It shows how precision is better for extreme percentiles. The reason why error diminishes for large number of values is that the law of large numbers makes the distribution of values more and more uniform and the t-digest tree can do a better job at summarizing it. It would not be the case on more skewed distributions.
48.8.3. Compression
experimental[The compression parameter is specific to the current internal implementation of percentiles, and may change in the future]
Approximate algorithms must balance memory utilization with estimation accuracy.
This balance can be controlled using a compression parameter:
{
"aggs" : {
"load_time_outlier" : {
"percentiles" : {
"field" : "load_time",
"compression" : 200
}
}
}
}
| Compression controls memory usage and approximation error |
The TDigest algorithm uses a number of "nodes" to approximate percentiles — the
more nodes available, the higher the accuracy (and large memory footprint) proportional
to the volume of data. The compression parameter limits the maximum number of
nodes to 20 * compression.
Therefore, by increasing the compression value, you can increase the accuracy of
your percentiles at the cost of more memory. Larger compression values also
make the algorithm slower since the underlying tree data structure grows in size,
resulting in more expensive operations. The default compression value is
100.
A "node" uses roughly 32 bytes of memory, so under worst-case scenarios (large amount of data which arrives sorted and in-order) the default settings will produce a TDigest roughly 64KB in size. In practice data tends to be more random and the TDigest will use less memory.
48.8.4. HDR Histogram
experimental[]
HDR Histogram (High Dynamic Range Histogram) is an alternative implementation that can be useful when calculating percentiles for latency measurements as it can be faster than the t-digest implementation with the trade-off of a larger memory footprint. This implementation maintains a fixed worse-case percentage error (specified as a number of significant digits). This means that if data is recorded with values from 1 microsecond up to 1 hour (3,600,000,000 microseconds) in a histogram set to 3 significant digits, it will maintain a value resolution of 1 microsecond for values up to 1 millisecond and 3.6 seconds (or better) for the maximum tracked value (1 hour).
The HDR Histogram can be used by specifying the method parameter in the request:
{
"aggs" : {
"load_time_outlier" : {
"percentiles" : {
"field" : "load_time",
"percents" : [95, 99, 99.9],
"method" : "hdr",
"number_of_significant_value_digits" : 3
}
}
}
}
The method parameter is set to hdr to indicate that HDR Histogram should be used to calculate the percentiles |
|
number_of_significant_value_digits specifies the resolution of values for the histogram in number of significant digits |
The HDRHistogram only supports positive values and will error if it is passed a negative value. It is also not a good idea to use the HDRHistogram if the range of values is unknown as this could lead to high memory usage.
48.8.5. Missing value
The missing parameter defines how documents that are missing a value should be treated.
By default they will be ignored but it is also possible to treat them as if they
had a value.
{
"aggs" : {
"grade_percentiles" : {
"percentiles" : {
"field" : "grade",
"missing": 10
}
}
}
}
Documents without a value in the grade field will fall into the same bucket as documents that have the value 10. |
48.9. Percentile Ranks Aggregation
A multi-value metrics aggregation that calculates one or more percentile ranks
over numeric values extracted from the aggregated documents. These values
can be extracted either from specific numeric fields in the documents, or
be generated by a provided script.
|
|
Please see Percentiles are (usually) approximate and Compression for advice regarding approximation and memory use of the percentile ranks aggregation |
Percentile rank show the percentage of observed values which are below certain value. For example, if a value is greater than or equal to 95% of the observed values it is said to be at the 95th percentile rank.
Assume your data consists of website load times. You may have a service agreement that 95% of page loads completely within 15ms and 99% of page loads complete within 30ms.
Let’s look at a range of percentiles representing load time:
{
"aggs" : {
"load_time_outlier" : {
"percentile_ranks" : {
"field" : "load_time",
"values" : [15, 30]
}
}
}
}
The field load_time must be a numeric field |
The response will look like this:
{
...
"aggregations": {
"load_time_outlier": {
"values" : {
"15": 92,
"30": 100
}
}
}
}
From this information you can determine you are hitting the 99% load time target but not quite hitting the 95% load time target
48.9.1. Script
The percentile rank metric supports scripting. For example, if our load times are in milliseconds but we want to specify values in seconds, we could use a script to convert them on-the-fly:
{
"aggs" : {
"load_time_outlier" : {
"percentile_ranks" : {
"values" : [3, 5],
"script" : {
"inline": "doc['load_time'].value / timeUnit",
"params" : {
"timeUnit" : 1000
}
}
}
}
}
}
The field parameter is replaced with a script parameter, which uses the
script to generate values which percentile ranks are calculated on |
|
| Scripting supports parameterized input just like any other script |
This will interpret the script parameter as an inline script with the default script language and no script parameters. To use a file script use the following syntax:
{
"aggs" : {
"load_time_outlier" : {
"percentile_ranks" : {
"values" : [3, 5],
"script" : {
"file": "my_script",
"params" : {
"timeUnit" : 1000
}
}
}
}
}
}
|
|
for indexed scripts replace the file parameter with an id parameter.
|
48.9.2. HDR Histogram
experimental[]
HDR Histogram (High Dynamic Range Histogram) is an alternative implementation that can be useful when calculating percentile ranks for latency measurements as it can be faster than the t-digest implementation with the trade-off of a larger memory footprint. This implementation maintains a fixed worse-case percentage error (specified as a number of significant digits). This means that if data is recorded with values from 1 microsecond up to 1 hour (3,600,000,000 microseconds) in a histogram set to 3 significant digits, it will maintain a value resolution of 1 microsecond for values up to 1 millisecond and 3.6 seconds (or better) for the maximum tracked value (1 hour).
The HDR Histogram can be used by specifying the method parameter in the request:
{
"aggs" : {
"load_time_outlier" : {
"percentile_ranks" : {
"field" : "load_time",
"values" : [15, 30],
"method" : "hdr",
"number_of_significant_value_digits" : 3
}
}
}
}
The method parameter is set to hdr to indicate that HDR Histogram should be used to calculate the percentile_ranks |
|
number_of_significant_value_digits specifies the resolution of values for the histogram in number of significant digits |
The HDRHistogram only supports positive values and will error if it is passed a negative value. It is also not a good idea to use the HDRHistogram if the range of values is unknown as this could lead to high memory usage.
48.9.3. Missing value
The missing parameter defines how documents that are missing a value should be treated.
By default they will be ignored but it is also possible to treat them as if they
had a value.
{
"aggs" : {
"grade_ranks" : {
"percentile_ranks" : {
"field" : "grade",
"missing": 10
}
}
}
}
Documents without a value in the grade field will fall into the same bucket as documents that have the value 10. |
48.10. Scripted Metric Aggregation
experimental[]
A metric aggregation that executes using scripts to provide a metric output.
Example:
{
"query" : {
"match_all" : {}
},
"aggs": {
"profit": {
"scripted_metric": {
"init_script" : "_agg['transactions'] = []",
"map_script" : "if (doc['type'].value == \"sale\") { _agg.transactions.add(doc['amount'].value) } else { _agg.transactions.add(-1 * doc['amount'].value) }",
"combine_script" : "profit = 0; for (t in _agg.transactions) { profit += t }; return profit",
"reduce_script" : "profit = 0; for (a in _aggs) { profit += a }; return profit"
}
}
}
}
map_script is the only required parameter |
The above aggregation demonstrates how one would use the script aggregation compute the total profit from sale and cost transactions.
The response for the above aggregation:
{
...
"aggregations": {
"profit": {
"value": 170
}
}
}
The above example can also be specified using file scripts as follows:
{
"query" : {
"match_all" : {}
},
"aggs": {
"profit": {
"scripted_metric": {
"init_script" : {
"file": "my_init_script"
},
"map_script" : {
"file": "my_map_script"
},
"combine_script" : {
"file": "my_combine_script"
},
"params": {
"field": "amount"
},
"reduce_script" : {
"file": "my_reduce_script"
},
}
}
}
}
script parameters for init, map and combine scripts must be specified in a global params object so that it can be share between the scripts |
For more details on specifying scripts see script documentation.
48.10.1. Allowed return types
Whilst and valid script object can be used within a single script. the scripts must return or store in the _agg object only the following types:
-
primitive types
-
String
-
Map (containing only keys and values of the types listed here)
-
Array (containing elements of only the types listed here)
48.10.2. Scope of scripts
The scripted metric aggregation uses scripts at 4 stages of its execution:
- init_script
-
Executed prior to any collection of documents. Allows the aggregation to set up any initial state.
In the above example, the
init_scriptcreates an arraytransactionsin the_aggobject. - map_script
-
Executed once per document collected. This is the only required script. If no combine_script is specified, the resulting state needs to be stored in an object named
_agg.In the above example, the
map_scriptchecks the value of the type field. If the value if sale the value of the amount field is added to the transactions array. If the value of the type field is not sale the negated value of the amount field is added to transactions. - combine_script
-
Executed once on each shard after document collection is complete. Allows the aggregation to consolidate the state returned from each shard. If a combine_script is not provided the combine phase will return the aggregation variable.
In the above example, the
combine_scriptiterates through all the stored transactions, summing the values in theprofitvariable and finally returnsprofit. - reduce_script
-
Executed once on the coordinating node after all shards have returned their results. The script is provided with access to a variable
_aggswhich is an array of the result of the combine_script on each shard. If a reduce_script is not provided the reduce phase will return the_aggsvariable.In the above example, the
reduce_scriptiterates through theprofitreturned by each shard summing the values before returning the final combined profit which will be returned in the response of the aggregation.
48.10.3. Worked Example
Imagine a situation where you index the following documents into and index with 2 shards:
$ curl -XPUT 'http://localhost:9200/transactions/stock/1' -d '
{
"type": "sale",
"amount": 80
}
'
$ curl -XPUT 'http://localhost:9200/transactions/stock/2' -d '
{
"type": "cost",
"amount": 10
}
'
$ curl -XPUT 'http://localhost:9200/transactions/stock/3' -d '
{
"type": "cost",
"amount": 30
}
'
$ curl -XPUT 'http://localhost:9200/transactions/stock/4' -d '
{
"type": "sale",
"amount": 130
}
'
Lets say that documents 1 and 3 end up on shard A and documents 2 and 4 end up on shard B. The following is a breakdown of what the aggregation result is at each stage of the example above.
Before init_script
No params object was specified so the default params object is used:
"params" : {
"_agg" : {}
}
After init_script
This is run once on each shard before any document collection is performed, and so we will have a copy on each shard:
- Shard A
-
"params" : { "_agg" : { "transactions" : [] } } - Shard B
-
"params" : { "_agg" : { "transactions" : [] } }
After map_script
Each shard collects its documents and runs the map_script on each document that is collected:
- Shard A
-
"params" : { "_agg" : { "transactions" : [ 80, -30 ] } } - Shard B
-
"params" : { "_agg" : { "transactions" : [ -10, 130 ] } }
After combine_script
The combine_script is executed on each shard after document collection is complete and reduces all the transactions down to a single profit figure for each shard (by summing the values in the transactions array) which is passed back to the coordinating node:
- Shard A
-
50
- Shard B
-
120
After reduce_script
The reduce_script receives an _aggs array containing the result of the combine script for each shard:
"_aggs" : [
50,
120
]
It reduces the responses for the shards down to a final overall profit figure (by summing the values) and returns this as the result of the aggregation to produce the response:
{
...
"aggregations": {
"profit": {
"value": 170
}
}
}
48.10.4. Other Parameters
| params |
Optional. An object whose contents will be passed as variables to the
|
| reduce_params |
Optional. An object whose contents will be passed as variables to the |
48.11. Stats Aggregation
A multi-value metrics aggregation that computes stats over numeric values extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script.
The stats that are returned consist of: min, max, sum, count and avg.
Assuming the data consists of documents representing exams grades (between 0 and 100) of students
{
"aggs" : {
"grades_stats" : { "stats" : { "field" : "grade" } }
}
}
The above aggregation computes the grades statistics over all documents. The aggregation type is stats and the field setting defines the numeric field of the documents the stats will be computed on. The above will return the following:
{
...
"aggregations": {
"grades_stats": {
"count": 6,
"min": 60,
"max": 98,
"avg": 78.5,
"sum": 471
}
}
}
The name of the aggregation (grades_stats above) also serves as the key by which the aggregation result can be retrieved from the returned response.
48.11.1. Script
Computing the grades stats based on a script:
{
...,
"aggs" : {
"grades_stats" : { "stats" : { "script" : "doc['grade'].value" } }
}
}
This will interpret the script parameter as an inline script with the default script language and no script parameters. To use a file script use the following syntax:
{
...,
"aggs" : {
"grades_stats" : {
"stats" : {
"script" : {
"file": "my_script",
"params" : {
"field" : "grade"
}
}
}
}
}
}
|
|
for indexed scripts replace the file parameter with an id parameter.
|
Value Script
It turned out that the exam was way above the level of the students and a grade correction needs to be applied. We can use a value script to get the new stats:
{
"aggs" : {
...
"aggs" : {
"grades_stats" : {
"stats" : {
"field" : "grade",
"script" :
"inline": "_value * correction",
"params" : {
"correction" : 1.2
}
}
}
}
}
}
}
48.11.2. Missing value
The missing parameter defines how documents that are missing a value should be treated.
By default they will be ignored but it is also possible to treat them as if they
had a value.
{
"aggs" : {
"grades_stats" : {
"stats" : {
"field" : "grade",
"missing": 0
}
}
}
}
Documents without a value in the grade field will fall into the same bucket as documents that have the value 0. |
48.12. Sum Aggregation
A single-value metrics aggregation that sums up numeric values that are extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script.
Assuming the data consists of documents representing stock ticks, where each tick holds the change in the stock price from the previous tick.
{
"query" : {
"constant_score" : {
"filter" : {
"range" : { "timestamp" : { "from" : "now/1d+9.5h", "to" : "now/1d+16h" }}
}
}
},
"aggs" : {
"intraday_return" : { "sum" : { "field" : "change" } }
}
}
The above aggregation sums up all changes in the today’s trading stock ticks which accounts for the intraday return. The aggregation type is sum and the field setting defines the numeric field of the documents of which values will be summed up. The above will return the following:
{
...
"aggregations": {
"intraday_return": {
"value": 2.18
}
}
}
The name of the aggregation (intraday_return above) also serves as the key by which the aggregation result can be retrieved from the returned response.
48.12.1. Script
Computing the intraday return based on a script:
{
...,
"aggs" : {
"intraday_return" : { "sum" : { "script" : "doc['change'].value" } }
}
}
This will interpret the script parameter as an inline script with the default script language and no script parameters. To use a file script use the following syntax:
{
...,
"aggs" : {
"intraday_return" : {
"sum" : {
"script" : {
"file": "my_script",
"params" : {
"field" : "change"
}
}
}
}
}
}
|
|
for indexed scripts replace the file parameter with an id parameter.
|
48.12.2. Missing value
The missing parameter defines how documents that are missing a value should be treated.
By default they will be ignored but it is also possible to treat them as if they
had a value.
{
"aggs" : {
"total_time" : {
"sum" : {
"field" : "took",
"missing": 100
}
}
}
}
Documents without a value in the took field will fall into the same bucket as documents that have the value 100. |
48.13. Top hits Aggregation
A top_hits metric aggregator keeps track of the most relevant document being aggregated. This aggregator is intended
to be used as a sub aggregator, so that the top matching documents can be aggregated per bucket.
The top_hits aggregator can effectively be used to group result sets by certain fields via a bucket aggregator.
One or more bucket aggregators determines by which properties a result set get sliced into.
48.13.1. Options
-
from- The offset from the first result you want to fetch. -
size- The maximum number of top matching hits to return per bucket. By default the top three matching hits are returned. -
sort- How the top matching hits should be sorted. By default the hits are sorted by the score of the main query.
48.13.2. Supported per hit features
The top_hits aggregation returns regular search hits, because of this many per hit features can be supported:
48.13.3. Example
In the following example we group the questions by tag and per tag we show the last active question. For each question only the title field is being included in the source.
{
"aggs": {
"top-tags": {
"terms": {
"field": "tags",
"size": 3
},
"aggs": {
"top_tag_hits": {
"top_hits": {
"sort": [
{
"last_activity_date": {
"order": "desc"
}
}
],
"_source": {
"include": [
"title"
]
},
"size" : 1
}
}
}
}
}
}
Possible response snippet:
"aggregations": {
"top-tags": {
"buckets": [
{
"key": "windows-7",
"doc_count": 25365,
"top_tags_hits": {
"hits": {
"total": 25365,
"max_score": 1,
"hits": [
{
"_index": "stack",
"_type": "question",
"_id": "602679",
"_score": 1,
"_source": {
"title": "Windows port opening"
},
"sort": [
1370143231177
]
}
]
}
}
},
{
"key": "linux",
"doc_count": 18342,
"top_tags_hits": {
"hits": {
"total": 18342,
"max_score": 1,
"hits": [
{
"_index": "stack",
"_type": "question",
"_id": "602672",
"_score": 1,
"_source": {
"title": "Ubuntu RFID Screensaver lock-unlock"
},
"sort": [
1370143379747
]
}
]
}
}
},
{
"key": "windows",
"doc_count": 18119,
"top_tags_hits": {
"hits": {
"total": 18119,
"max_score": 1,
"hits": [
{
"_index": "stack",
"_type": "question",
"_id": "602678",
"_score": 1,
"_source": {
"title": "If I change my computers date / time, what could be affected?"
},
"sort": [
1370142868283
]
}
]
}
}
}
]
}
}
48.13.4. Field collapse example
Field collapsing or result grouping is a feature that logically groups a result set into groups and per group returns
top documents. The ordering of the groups is determined by the relevancy of the first document in a group. In
Elasticsearch this can be implemented via a bucket aggregator that wraps a top_hits aggregator as sub-aggregator.
In the example below we search across crawled webpages. For each webpage we store the body and the domain the webpage
belong to. By defining a terms aggregator on the domain field we group the result set of webpages by domain. The
top_docs aggregator is then defined as sub-aggregator, so that the top matching hits are collected per bucket.
Also a max aggregator is defined which is used by the terms aggregator’s order feature the return the buckets by
relevancy order of the most relevant document in a bucket.
{
"query": {
"match": {
"body": "elections"
}
},
"aggs": {
"top-sites": {
"terms": {
"field": "domain",
"order": {
"top_hit": "desc"
}
},
"aggs": {
"top_tags_hits": {
"top_hits": {}
},
"top_hit" : {
"max": {
"script": "_score"
}
}
}
}
}
}
At the moment the max (or min) aggregator is needed to make sure the buckets from the terms aggregator are
ordered according to the score of the most relevant webpage per domain. The top_hits aggregator isn’t a metric aggregator
and therefore can’t be used in the order option of the terms aggregator.
48.13.5. top_hits support in a nested or reverse_nested aggregator
If the top_hits aggregator is wrapped in a nested or reverse_nested aggregator then nested hits are being returned.
Nested hits are in a sense hidden mini documents that are part of regular document where in the mapping a nested field type
has been configured. The top_hits aggregator has the ability to un-hide these documents if it is wrapped in a nested
or reverse_nested aggregator. Read more about nested in the nested type mapping.
If nested type has been configured a single document is actually indexed as multiple Lucene documents and they share
the same id. In order to determine the identity of a nested hit there is more needed than just the id, so that is why
nested hits also include their nested identity. The nested identity is kept under the _nested field in the search hit
and includes the array field and the offset in the array field the nested hit belongs to. The offset is zero based.
Top hits response snippet with a nested hit, which resides in the third slot of array field nested_field1 in document with id 1:
...
"hits": {
"total": 25365,
"max_score": 1,
"hits": [
{
"_index": "a",
"_type": "b",
"_id": "1",
"_score": 1,
"_nested" : {
"field" : "nested_field1",
"offset" : 2
}
"_source": ...
},
...
]
}
...
If _source is requested then just the part of the source of the nested object is returned, not the entire source of the document.
Also stored fields on the nested inner object level are accessible via top_hits aggregator residing in a nested or reverse_nested aggregator.
Only nested hits will have a _nested field in the hit, non nested (regular) hits will not have a _nested field.
The information in _nested can also be used to parse the original source somewhere else if _source isn’t enabled.
If there are multiple levels of nested object types defined in mappings then the _nested information can also be hierarchical
in order to express the identity of nested hits that are two layers deep or more.
In the example below a nested hit resides in the first slot of the field nested_grand_child_field which then resides in
the second slow of the nested_child_field field:
...
"hits": {
"total": 2565,
"max_score": 1,
"hits": [
{
"_index": "a",
"_type": "b",
"_id": "1",
"_score": 1,
"_nested" : {
"field" : "nested_child_field",
"offset" : 1,
"_nested" : {
"field" : "nested_grand_child_field",
"offset" : 0
}
}
"_source": ...
},
...
]
}
...
48.14. Value Count Aggregation
A single-value metrics aggregation that counts the number of values that are extracted from the aggregated documents.
These values can be extracted either from specific fields in the documents, or be generated by a provided script. Typically,
this aggregator will be used in conjunction with other single-value aggregations. For example, when computing the avg
one might be interested in the number of values the average is computed over.
{
"aggs" : {
"grades_count" : { "value_count" : { "field" : "grade" } }
}
}
Response:
{
...
"aggregations": {
"grades_count": {
"value": 10
}
}
}
The name of the aggregation (grades_count above) also serves as the key by which the aggregation result can be
retrieved from the returned response.
48.14.1. Script
Counting the values generated by a script:
{
...,
"aggs" : {
"grades_count" : { "value_count" : { "script" : "doc['grade'].value" } }
}
}
This will interpret the script parameter as an inline script with the default script language and no script parameters. To use a file script use the following syntax:
{
...,
"aggs" : {
"grades_count" : {
"value_count" : {
"script" : {
"file": "my_script",
"params" : {
"field" : "grade"
}
}
}
}
}
}
|
|
for indexed scripts replace the file parameter with an id parameter.
|
49. Bucket Aggregations
Bucket aggregations don’t calculate metrics over fields like the metrics aggregations do, but instead, they create
buckets of documents. Each bucket is associated with a criterion (depending on the aggregation type) which determines
whether or not a document in the current context "falls" into it. In other words, the buckets effectively define document
sets. In addition to the buckets themselves, the bucket aggregations also compute and return the number of documents
that "fell in" to each bucket.
Bucket aggregations, as opposed to metrics aggregations, can hold sub-aggregations. These sub-aggregations will be
aggregated for the buckets created by their "parent" bucket aggregation.
There are different bucket aggregators, each with a different "bucketing" strategy. Some define a single bucket, some define fixed number of multiple buckets, and others dynamically create the buckets during the aggregation process.
49.1. Children Aggregation
A special single bucket aggregation that enables aggregating from buckets on parent document types to buckets on child documents.
This aggregation relies on the _parent field in the mapping. This aggregation has a single option:
-
type- The what child type the buckets in the parent space should be mapped to.
For example, let’s say we have an index of questions and answers. The answer type has the following _parent field in the mapping:
{
"answer" : {
"_parent" : {
"type" : "question"
}
}
}
The question typed document contain a tag field and the answer typed documents contain an owner field. With the children
aggregation the tag buckets can be mapped to the owner buckets in a single request even though the two fields exist in
two different kinds of documents.
An example of a question typed document:
{
"body": "<p>I have Windows 2003 server and i bought a new Windows 2008 server...",
"title": "Whats the best way to file transfer my site from server to a newer one?",
"tags": [
"windows-server-2003",
"windows-server-2008",
"file-transfer"
],
}
An example of an answer typed document:
{
"owner": {
"location": "Norfolk, United Kingdom",
"display_name": "Sam",
"id": 48
},
"body": "<p>Unfortunately your pretty much limited to FTP...",
"creation_date": "2009-05-04T13:45:37.030"
}
The following request can be built that connects the two together:
{
"aggs": {
"top-tags": {
"terms": {
"field": "tags",
"size": 10
},
"aggs": {
"to-answers": {
"children": {
"type" : "answer"
},
"aggs": {
"top-names": {
"terms": {
"field": "owner.display_name",
"size": 10
}
}
}
}
}
}
}
}
The type points to type / mapping with the name answer. |
The above example returns the top question tags and per tag the top answer owners.
Possible response:
{
"aggregations": {
"top-tags": {
"buckets": [
{
"key": "windows-server-2003",
"doc_count": 25365,
"to-answers": {
"doc_count": 36004,
"top-names": {
"buckets": [
{
"key": "Sam",
"doc_count": 274
},
{
"key": "chris",
"doc_count": 19
},
{
"key": "david",
"doc_count": 14
},
...
]
}
}
},
{
"key": "linux",
"doc_count": 18342,
"to-answers": {
"doc_count": 6655,
"top-names": {
"buckets": [
{
"key": "abrams",
"doc_count": 25
},
{
"key": "ignacio",
"doc_count": 25
},
{
"key": "vazquez",
"doc_count": 25
},
...
]
}
}
},
{
"key": "windows",
"doc_count": 18119,
"to-answers": {
"doc_count": 24051,
"top-names": {
"buckets": [
{
"key": "molly7244",
"doc_count": 265
},
{
"key": "david",
"doc_count": 27
},
{
"key": "chris",
"doc_count": 26
},
...
]
}
}
},
{
"key": "osx",
"doc_count": 10971,
"to-answers": {
"doc_count": 5902,
"top-names": {
"buckets": [
{
"key": "diago",
"doc_count": 4
},
{
"key": "albert",
"doc_count": 3
},
{
"key": "asmus",
"doc_count": 3
},
...
]
}
}
},
{
"key": "ubuntu",
"doc_count": 8743,
"to-answers": {
"doc_count": 8784,
"top-names": {
"buckets": [
{
"key": "ignacio",
"doc_count": 9
},
{
"key": "abrams",
"doc_count": 8
},
{
"key": "molly7244",
"doc_count": 8
},
...
]
}
}
},
{
"key": "windows-xp",
"doc_count": 7517,
"to-answers": {
"doc_count": 13610,
"top-names": {
"buckets": [
{
"key": "molly7244",
"doc_count": 232
},
{
"key": "chris",
"doc_count": 9
},
{
"key": "john",
"doc_count": 9
},
...
]
}
}
},
{
"key": "networking",
"doc_count": 6739,
"to-answers": {
"doc_count": 2076,
"top-names": {
"buckets": [
{
"key": "molly7244",
"doc_count": 6
},
{
"key": "alnitak",
"doc_count": 5
},
{
"key": "chris",
"doc_count": 3
},
...
]
}
}
},
{
"key": "mac",
"doc_count": 5590,
"to-answers": {
"doc_count": 999,
"top-names": {
"buckets": [
{
"key": "abrams",
"doc_count": 2
},
{
"key": "ignacio",
"doc_count": 2
},
{
"key": "vazquez",
"doc_count": 2
},
...
]
}
}
},
{
"key": "wireless-networking",
"doc_count": 4409,
"to-answers": {
"doc_count": 6497,
"top-names": {
"buckets": [
{
"key": "molly7244",
"doc_count": 61
},
{
"key": "chris",
"doc_count": 5
},
{
"key": "mike",
"doc_count": 5
},
...
]
}
}
},
{
"key": "windows-8",
"doc_count": 3601,
"to-answers": {
"doc_count": 4263,
"top-names": {
"buckets": [
{
"key": "molly7244",
"doc_count": 3
},
{
"key": "msft",
"doc_count": 2
},
{
"key": "user172132",
"doc_count": 2
},
...
]
}
}
}
]
}
}
}
The number of question documents with the tag windows-server-2003. |
|
The number of answer documents that are related to question documents with the tag windows-server-2003. |
49.2. Date Histogram Aggregation
A multi-bucket aggregation similar to the histogram except it can
only be applied on date values. Since dates are represented in elasticsearch internally as long values, it is possible
to use the normal histogram on dates as well, though accuracy will be compromised. The reason for this is in the fact
that time based intervals are not fixed (think of leap years and on the number of days in a month). For this reason,
we need special support for time based data. From a functionality perspective, this histogram supports the same features
as the normal histogram. The main difference is that the interval can be specified by date/time expressions.
Requesting bucket intervals of a month.
{
"aggs" : {
"articles_over_time" : {
"date_histogram" : {
"field" : "date",
"interval" : "month"
}
}
}
}
Available expressions for interval: year, quarter, month, week, day, hour, minute, second
Fractional values are allowed for seconds, minutes, hours, days and weeks. For example 1.5 hours:
{
"aggs" : {
"articles_over_time" : {
"date_histogram" : {
"field" : "date",
"interval" : "1.5h"
}
}
}
}
See Time units for accepted abbreviations.
49.2.1. Keys
Internally, a date is represented as a 64 bit number representing a timestamp
in milliseconds-since-the-epoch. These timestamps are returned as the bucket
keys. The key_as_string is the same timestamp converted to a formatted
date string using the format specified with the format parameter:
|
|
If no format is specified, then it will use the first date
format specified in the field mapping.
|
{
"aggs" : {
"articles_over_time" : {
"date_histogram" : {
"field" : "date",
"interval" : "1M",
"format" : "yyyy-MM-dd"
}
}
}
}
| Supports expressive date format pattern |
Response:
{
"aggregations": {
"articles_over_time": {
"buckets": [
{
"key_as_string": "2013-02-02",
"key": 1328140800000,
"doc_count": 1
},
{
"key_as_string": "2013-03-02",
"key": 1330646400000,
"doc_count": 2
},
...
]
}
}
}
49.2.2. Time Zone
Date-times are stored in Elasticsearch in UTC. By default, all bucketing and
rounding is also done in UTC. The time_zone parameter can be used to indicate
that bucketing should use a different time zone.
Time zones may either be specified as an ISO 8601 UTC offset (e.g. +01:00 or
-08:00) or as a timezone id, an identifier used in the TZ database like
America/Los_Angeles.
Consider the following example:
PUT my_index/log/1
{
"date": "2015-10-01T00:30:00Z"
}
PUT my_index/log/2
{
"date": "2015-10-01T01:30:00Z"
}
GET my_index/_search?size=0
{
"aggs": {
"by_day": {
"date_histogram": {
"field": "date",
"interval": "day"
}
}
}
}
UTC is used if no time zone is specified, which would result in both of these documents being placed into the same day bucket, which starts at midnight UTC on 1 October 2015:
"aggregations": {
"by_day": {
"buckets": [
{
"key_as_string": "2015-10-01T00:00:00.000Z",
"key": 1443657600000,
"doc_count": 2
}
]
}
}
If a time_zone of -01:00 is specified, then midnight starts at one hour before
midnight UTC:
GET my_index/_search?size=0
{
"aggs": {
"by_day": {
"date_histogram": {
"field": "date",
"interval": "day",
"time_zone": "-01:00"
}
}
}
}
Now the first document falls into the bucket for 30 September 2015, while the second document falls into the bucket for 1 October 2015:
"aggregations": {
"by_day": {
"buckets": [
{
"key_as_string": "2015-09-30T00:00:00.000-01:00",
"key": 1443571200000,
"doc_count": 1
},
{
"key_as_string": "2015-10-01T00:00:00.000-01:00",
"key": 1443657600000,
"doc_count": 1
}
]
}
}
The key_as_string value represents midnight on each day
in the specified time zone. |
49.2.3. Offset
The offset parameter is used to change the start value of each bucket by the
specified positive (+) or negative offset (-) duration, such as 1h for
an hour, or 1M for a month. See Time units for more possible time
duration options.
For instance, when using an interval of day, each bucket runs from midnight
to midnight. Setting the offset parameter to +6h would change each bucket
to run from 6am to 6am:
PUT my_index/log/1
{
"date": "2015-10-01T05:30:00Z"
}
PUT my_index/log/2
{
"date": "2015-10-01T06:30:00Z"
}
GET my_index/_search?size=0
{
"aggs": {
"by_day": {
"date_histogram": {
"field": "date",
"interval": "day",
"offset": "+6h"
}
}
}
}
Instead of a single bucket starting at midnight, the above request groups the documents into buckets starting at 6am:
"aggregations": {
"by_day": {
"buckets": [
{
"key_as_string": "2015-09-30T06:00:00.000Z",
"key": 1443592800000,
"doc_count": 1
},
{
"key_as_string": "2015-10-01T06:00:00.000Z",
"key": 1443679200000,
"doc_count": 1
}
]
}
}
|
|
The start offset of each bucket is calculated after the time_zone
adjustments have been made.
|
49.2.4. Scripts
Like with the normal histogram, both document level scripts and
value level scripts are supported. It is also possible to control the order of the returned buckets using the order
settings and filter the returned buckets based on a min_doc_count setting (by default all buckets between the first
bucket that matches documents and the last one are returned). This histogram also supports the extended_bounds
setting, which enables extending the bounds of the histogram beyond the data itself (to read more on why you’d want to
do that please refer to the explanation here).
49.2.5. Missing value
The missing parameter defines how documents that are missing a value should be treated.
By default they will be ignored but it is also possible to treat them as if they
had a value.
{
"aggs" : {
"publish_date" : {
"date_histogram" : {
"field" : "publish_date",
"interval": "year",
"missing": "2000-01-01"
}
}
}
}
Documents without a value in the publish_date field will fall into the same bucket as documents that have the value 2000-01-01. |
49.3. Date Range Aggregation
A range aggregation that is dedicated for date values. The main difference between this aggregation and the normal range aggregation is that the from and to values can be expressed in Date Math expressions, and it is also possible to specify a date format by which the from and to response fields will be returned.
Note that this aggregation includes the from value and excludes the to value for each range.
Example:
{
"aggs": {
"range": {
"date_range": {
"field": "date",
"format": "MM-yyy",
"ranges": [
{ "to": "now-10M/M" },
{ "from": "now-10M/M" }
]
}
}
}
}
| < now minus 10 months, rounded down to the start of the month. | |
| >= now minus 10 months, rounded down to the start of the month. |
In the example above, we created two range buckets, the first will "bucket" all documents dated prior to 10 months ago and the second will "bucket" all documents dated since 10 months ago
Response:
{
...
"aggregations": {
"range": {
"buckets": [
{
"to": 1.3437792E+12,
"to_as_string": "08-2012",
"doc_count": 7
},
{
"from": 1.3437792E+12,
"from_as_string": "08-2012",
"doc_count": 2
}
]
}
}
}
49.3.1. Date Format/Pattern
|
|
this information was copied from JodaDate |
All ASCII letters are reserved as format pattern letters, which are defined as follows:
| Symbol | Meaning | Presentation | Examples |
|---|---|---|---|
G |
era |
text |
AD |
C |
century of era (>=0) |
number |
20 |
Y |
year of era (>=0) |
year |
1996 |
x |
weekyear |
year |
1996 |
w |
week of weekyear |
number |
27 |
e |
day of week |
number |
2 |
E |
day of week |
text |
Tuesday; Tue |
y |
year |
year |
1996 |
D |
day of year |
number |
189 |
M |
month of year |
month |
July; Jul; 07 |
d |
day of month |
number |
10 |
a |
halfday of day |
text |
PM |
K |
hour of halfday (0~11) |
number |
0 |
h |
clockhour of halfday (1~12) |
number |
12 |
H |
hour of day (0~23) |
number |
0 |
k |
clockhour of day (1~24) |
number |
24 |
m |
minute of hour |
number |
30 |
s |
second of minute |
number |
55 |
S |
fraction of second |
number |
978 |
z |
time zone |
text |
Pacific Standard Time; PST |
Z |
time zone offset/id |
zone |
-0800; -08:00; America/Los_Angeles |
' |
escape for text |
delimiter |
'' |
The count of pattern letters determine the format.
- Text
-
If the number of pattern letters is 4 or more, the full form is used; otherwise a short or abbreviated form is used if available.
- Number
-
The minimum number of digits. Shorter numbers are zero-padded to this amount.
- Year
-
Numeric presentation for year and weekyear fields are handled specially. For example, if the count of y is 2, the year will be displayed as the zero-based year of the century, which is two digits.
- Month
-
3 or over, use text, otherwise use number.
- Zone
-
Z outputs offset without a colon, ZZ outputs the offset with a colon, ZZZ or more outputs the zone id.
- Zone names
-
Time zone names (z) cannot be parsed.
Any characters in the pattern that are not in the ranges of [a..z] and [A..Z] will be treated as quoted text. For instance, characters like :, ., ' , '# and ? will appear in the resulting time text even they are not embraced within single quotes.
49.4. Filter Aggregation
Defines a single bucket of all the documents in the current document set context that match a specified filter. Often this will be used to narrow down the current aggregation context to a specific set of documents.
Example:
{
"aggs" : {
"red_products" : {
"filter" : { "term": { "color": "red" } },
"aggs" : {
"avg_price" : { "avg" : { "field" : "price" } }
}
}
}
}
In the above example, we calculate the average price of all the products that are red.
Response:
{
...
"aggs" : {
"red_products" : {
"doc_count" : 100,
"avg_price" : { "value" : 56.3 }
}
}
}
49.5. Filters Aggregation
Defines a multi bucket aggregations where each bucket is associated with a filter. Each bucket will collect all documents that match its associated filter.
Example:
{
"aggs" : {
"messages" : {
"filters" : {
"filters" : {
"errors" : { "term" : { "body" : "error" }},
"warnings" : { "term" : { "body" : "warning" }}
}
},
"aggs" : {
"monthly" : {
"histogram" : {
"field" : "timestamp",
"interval" : "1M"
}
}
}
}
}
}
In the above example, we analyze log messages. The aggregation will build two collection (buckets) of log messages - one for all those containing an error, and another for all those containing a warning. And for each of these buckets it will break them down by month.
Response:
...
"aggs" : {
"messages" : {
"buckets" : {
"errors" : {
"doc_count" : 34,
"monthly" : {
"buckets" : [
... // the histogram monthly breakdown
]
}
},
"warnings" : {
"doc_count" : 439,
"monthly" : {
"buckets" : [
... // the histogram monthly breakdown
]
}
}
}
}
}
...
49.5.1. Anonymous filters
The filters field can also be provided as an array of filters, as in the following request:
{
"aggs" : {
"messages" : {
"filters" : {
"filters" : [
{ "term" : { "body" : "error" }},
{ "term" : { "body" : "warning" }}
]
},
"aggs" : {
"monthly" : {
"histogram" : {
"field" : "timestamp",
"interval" : "1M"
}
}
}
}
}
}
The filtered buckets are returned in the same order as provided in the request. The response for this example would be:
...
"aggs" : {
"messages" : {
"buckets" : [
{
"doc_count" : 34,
"monthly" : {
"buckets : [
... // the histogram monthly breakdown
]
}
},
{
"doc_count" : 439,
"monthly" : {
"buckets : [
... // the histogram monthly breakdown
]
}
}
]
}
}
...
49.5.2. Other Bucket
The other_bucket parameter can be set to add a bucket to the response which will contain all documents that do
not match any of the given filters. The value of this parameter can be as follows:
false-
Does not compute the
otherbucket true-
Returns the
otherbucket bucket either in a bucket (named_other_by default) if named filters are being used, or as the last bucket if anonymous filters are being used
The other_bucket_key parameter can be used to set the key for the other bucket to a value other than the default _other_. Setting
this parameter will implicitly set the other_bucket parameter to true.
The following snippet shows a response where the other bucket is requested to be named other_messages.
{
"aggs" : {
"messages" : {
"filters" : {
"other_bucket_key": "other_messages",
"filters" : {
"errors" : { "term" : { "body" : "error" }},
"warnings" : { "term" : { "body" : "warning" }}
}
},
"aggs" : {
"monthly" : {
"histogram" : {
"field" : "timestamp",
"interval" : "1M"
}
}
}
}
}
}
The response would be something like the following:
...
"aggs" : {
"messages" : {
"buckets" : {
"errors" : {
"doc_count" : 34,
"monthly" : {
"buckets" : [
... // the histogram monthly breakdown
]
}
},
"warnings" : {
"doc_count" : 439,
"monthly" : {
"buckets" : [
... // the histogram monthly breakdown
]
}
},
"other_messages" : {
"doc_count" : 237,
"monthly" : {
"buckets" : [
... // the histogram monthly breakdown
]
}
}
}
}
}
}
...
49.6. Geo Distance Aggregation
A multi-bucket aggregation that works on geo_point fields and conceptually works very similar to the range aggregation. The user can define a point of origin and a set of distance range buckets. The aggregation evaluate the distance of each document value from the origin point and determines the buckets it belongs to based on the ranges (a document belongs to a bucket if the distance between the document and the origin falls within the distance range of the bucket).
{
"aggs" : {
"rings_around_amsterdam" : {
"geo_distance" : {
"field" : "location",
"origin" : "52.3760, 4.894",
"ranges" : [
{ "to" : 100 },
{ "from" : 100, "to" : 300 },
{ "from" : 300 }
]
}
}
}
}
Response:
{
"aggregations": {
"rings" : {
"buckets": [
{
"key": "*-100.0",
"from": 0,
"to": 100.0,
"doc_count": 3
},
{
"key": "100.0-300.0",
"from": 100.0,
"to": 300.0,
"doc_count": 1
},
{
"key": "300.0-*",
"from": 300.0,
"doc_count": 7
}
]
}
}
}
The specified field must be of type geo_point (which can only be set explicitly in the mappings). And it can also hold an array of geo_point fields, in which case all will be taken into account during aggregation. The origin point can accept all formats supported by the geo_point type:
-
Object format:
{ "lat" : 52.3760, "lon" : 4.894 }- this is the safest format as it is the most explicit about thelat&lonvalues -
String format:
"52.3760, 4.894"- where the first number is thelatand the second is thelon -
Array format:
[4.894, 52.3760]- which is based on theGeoJsonstandard and where the first number is thelonand the second one is thelat
By default, the distance unit is m (metres) but it can also accept: mi (miles), in (inches), yd (yards), km (kilometers), cm (centimeters), mm (millimeters).
{
"aggs" : {
"rings" : {
"geo_distance" : {
"field" : "location",
"origin" : "52.3760, 4.894",
"unit" : "mi",
"ranges" : [
{ "to" : 100 },
{ "from" : 100, "to" : 300 },
{ "from" : 300 }
]
}
}
}
}
| The distances will be computed as miles |
There are three distance calculation modes: sloppy_arc (the default), arc (most accurate) and plane (fastest). The arc calculation is the most accurate one but also the more expensive one in terms of performance. The sloppy_arc is faster but less accurate. The plane is the fastest but least accurate distance function. Consider using plane when your search context is "narrow" and spans smaller geographical areas (like cities or even countries). plane may return higher error mergins for searches across very large areas (e.g. cross continent search). The distance calculation type can be set using the distance_type parameter:
{
"aggs" : {
"rings" : {
"geo_distance" : {
"field" : "location",
"origin" : "52.3760, 4.894",
"distance_type" : "plane",
"ranges" : [
{ "to" : 100 },
{ "from" : 100, "to" : 300 },
{ "from" : 300 }
]
}
}
}
}
49.7. GeoHash grid Aggregation
A multi-bucket aggregation that works on geo_point fields and groups points into buckets that represent cells in a grid.
The resulting grid can be sparse and only contains cells that have matching data. Each cell is labeled using a geohash which is of user-definable precision.
-
High precision geohashes have a long string length and represent cells that cover only a small area.
-
Low precision geohashes have a short string length and represent cells that each cover a large area.
Geohashes used in this aggregation can have a choice of precision between 1 and 12.
|
|
The highest-precision geohash of length 12 produces cells that cover less than a square metre of land and so high-precision requests can be very costly in terms of RAM and result sizes. Please see the example below on how to first filter the aggregation to a smaller geographic area before requesting high-levels of detail. |
The specified field must be of type geo_point (which can only be set explicitly in the mappings) and it can also hold an array of geo_point fields, in which case all points will be taken into account during aggregation.
49.7.1. Simple low-precision request
{
"aggregations" : {
"myLarge-GrainGeoHashGrid" : {
"geohash_grid" : {
"field" : "location",
"precision" : 3
}
}
}
}
Response:
{
"aggregations": {
"myLarge-GrainGeoHashGrid": {
"buckets": [
{
"key": "svz",
"doc_count": 10964
},
{
"key": "sv8",
"doc_count": 3198
}
]
}
}
}
49.7.2. High-precision requests
When requesting detailed buckets (typically for displaying a "zoomed in" map) a filter like geo_bounding_box should be applied to narrow the subject area otherwise potentially millions of buckets will be created and returned.
{
"aggregations" : {
"zoomedInView" : {
"filter" : {
"geo_bounding_box" : {
"location" : {
"top_left" : "51.73, 0.9",
"bottom_right" : "51.55, 1.1"
}
}
},
"aggregations":{
"zoom1":{
"geohash_grid" : {
"field":"location",
"precision":8,
}
}
}
}
}
}
49.7.3. Cell dimensions at the equator
The table below shows the metric dimensions for cells covered by various string lengths of geohash. Cell dimensions vary with latitude and so the table is for the worst-case scenario at the equator.
| GeoHash length |
Area width x height |
| 1 |
5,009.4km x 4,992.6km |
| 2 |
1,252.3km x 624.1km |
| 3 |
156.5km x 156km |
| 4 |
39.1km x 19.5km |
| 5 |
4.9km x 4.9km |
| 6 |
1.2km x 609.4m |
| 7 |
152.9m x 152.4m |
| 8 |
38.2m x 19m |
| 9 |
4.8m x 4.8m |
| 10 |
1.2m x 59.5cm |
| 11 |
14.9cm x 14.9cm |
| 12 |
3.7cm x 1.9cm |
49.7.4. Options
| field |
Mandatory. The name of the field indexed with GeoPoints. |
| precision |
Optional. The string length of the geohashes used to define cells/buckets in the results. Defaults to 5. |
| size |
Optional. The maximum number of geohash buckets to return
(defaults to 10,000). When results are trimmed, buckets are
prioritised based on the volumes of documents they contain.
A value of |
| shard_size |
Optional. To allow for more accurate counting of the top cells
returned in the final result the aggregation defaults to
returning |
49.8. Global Aggregation
Defines a single bucket of all the documents within the search execution context. This context is defined by the indices and the document types you’re searching on, but is not influenced by the search query itself.
|
|
Global aggregators can only be placed as top level aggregators (it makes no sense to embed a global aggregator within another bucket aggregator) |
Example:
{
"query" : {
"match" : { "title" : "shirt" }
},
"aggs" : {
"all_products" : {
"global" : {},
"aggs" : {
"avg_price" : { "avg" : { "field" : "price" } }
}
}
}
}
The global aggregation has an empty body |
|
The sub-aggregations that are registered for this global aggregation |
The above aggregation demonstrates how one would compute aggregations (avg_price in this example) on all the documents in the search context, regardless of the query (in our example, it will compute the average price over all products in our catalog, not just on the "shirts").
The response for the above aggregation:
{
...
"aggregations" : {
"all_products" : {
"doc_count" : 100,
"avg_price" : {
"value" : 56.3
}
}
}
}
| The number of documents that were aggregated (in our case, all documents within the search context) |
49.9. Histogram Aggregation
A multi-bucket values source based aggregation that can be applied on numeric values extracted from the documents.
It dynamically builds fixed size (a.k.a. interval) buckets over the values. For example, if the documents have a field
that holds a price (numeric), we can configure this aggregation to dynamically build buckets with interval 5
(in case of price it may represent $5). When the aggregation executes, the price field of every document will be
evaluated and will be rounded down to its closest bucket - for example, if the price is 32 and the bucket size is 5
then the rounding will yield 30 and thus the document will "fall" into the bucket that is associated with the key 30.
To make this more formal, here is the rounding function that is used:
rem = value % interval
if (rem < 0) {
rem += interval
}
bucket_key = value - rem
From the rounding function above it can be seen that the intervals themselves must be integers.
|
|
Currently, values are cast to integers before being bucketed, which
might cause negative floating-point values to fall into the wrong bucket. For
instance, -4.5 with an interval of 2 would be cast to -4, and so would
end up in the -4 <= val < -2 bucket instead of the -6 <= val < -4 bucket.
|
The following snippet "buckets" the products based on their price by interval of 50:
{
"aggs" : {
"prices" : {
"histogram" : {
"field" : "price",
"interval" : 50
}
}
}
}
And the following may be the response:
{
"aggregations": {
"prices" : {
"buckets": [
{
"key": 0,
"doc_count": 2
},
{
"key": 50,
"doc_count": 4
},
{
"key": 100,
"doc_count": 0
},
{
"key": 150,
"doc_count": 3
}
]
}
}
}
49.9.1. Minimum document count
The response above show that no documents has a price that falls within the range of [100 - 150). By default the
response will fill gaps in the histogram with empty buckets. It is possible change that and request buckets with
a higher minimum count thanks to the min_doc_count setting:
{
"aggs" : {
"prices" : {
"histogram" : {
"field" : "price",
"interval" : 50,
"min_doc_count" : 1
}
}
}
}
Response:
{
"aggregations": {
"prices" : {
"buckets": [
{
"key": 0,
"doc_count": 2
},
{
"key": 50,
"doc_count": 4
},
{
"key": 150,
"doc_count": 3
}
]
}
}
}
By default the date_/histogram returns all the buckets within the range of the data itself, that is, the documents with the smallest values (on which with histogram) will determine the min bucket (the bucket with the smallest key) and the documents with the highest values will determine the max bucket (the bucket with the highest key). Often, when requesting empty buckets, this causes a confusion, specifically, when the data is also filtered.
To understand why, let’s look at an example:
Lets say the you’re filtering your request to get all docs with values between 0 and 500, in addition you’d like
to slice the data per price using a histogram with an interval of 50. You also specify "min_doc_count" : 0 as you’d
like to get all buckets even the empty ones. If it happens that all products (documents) have prices higher than 100,
the first bucket you’ll get will be the one with 100 as its key. This is confusing, as many times, you’d also like
to get those buckets between 0 - 100.
With extended_bounds setting, you now can "force" the histogram aggregation to start building buckets on a specific
min values and also keep on building buckets up to a max value (even if there are no documents anymore). Using
extended_bounds only makes sense when min_doc_count is 0 (the empty buckets will never be returned if min_doc_count
is greater than 0).
Note that (as the name suggest) extended_bounds is not filtering buckets. Meaning, if the extended_bounds.min is higher
than the values extracted from the documents, the documents will still dictate what the first bucket will be (and the
same goes for the extended_bounds.max and the last bucket). For filtering buckets, one should nest the histogram aggregation
under a range filter aggregation with the appropriate from/to settings.
Example:
{
"query" : {
"constant_score" : { "filter": { "range" : { "price" : { "to" : "500" } } } }
},
"aggs" : {
"prices" : {
"histogram" : {
"field" : "price",
"interval" : 50,
"extended_bounds" : {
"min" : 0,
"max" : 500
}
}
}
}
}
49.9.2. Order
By default the returned buckets are sorted by their key ascending, though the order behaviour can be controlled
using the order setting.
Ordering the buckets by their key - descending:
{
"aggs" : {
"prices" : {
"histogram" : {
"field" : "price",
"interval" : 50,
"order" : { "_key" : "desc" }
}
}
}
}
Ordering the buckets by their doc_count - ascending:
{
"aggs" : {
"prices" : {
"histogram" : {
"field" : "price",
"interval" : 50,
"order" : { "_count" : "asc" }
}
}
}
}
If the histogram aggregation has a direct metrics sub-aggregation, the latter can determine the order of the buckets:
{
"aggs" : {
"prices" : {
"histogram" : {
"field" : "price",
"interval" : 50,
"order" : { "price_stats.min" : "asc" }
},
"aggs" : {
"price_stats" : { "stats" : {} }
}
}
}
}
The { "price_stats.min" : asc" } will sort the buckets based on min value of their price_stats sub-aggregation. |
|
There is no need to configure the price field for the price_stats aggregation as it will inherit it by default from its parent histogram aggregation. |
It is also possible to order the buckets based on a "deeper" aggregation in the hierarchy. This is supported as long
as the aggregations path are of a single-bucket type, where the last aggregation in the path may either by a single-bucket
one or a metrics one. If it’s a single-bucket type, the order will be defined by the number of docs in the bucket (i.e. doc_count),
in case it’s a metrics one, the same rules as above apply (where the path must indicate the metric name to sort by in case of
a multi-value metrics aggregation, and in case of a single-value metrics aggregation the sort will be applied on that value).
The path must be defined in the following form:
AGG_SEPARATOR := '>' METRIC_SEPARATOR := '.' AGG_NAME := <the name of the aggregation> METRIC := <the name of the metric (in case of multi-value metrics aggregation)> PATH := <AGG_NAME>[<AGG_SEPARATOR><AGG_NAME>]*[<METRIC_SEPARATOR><METRIC>]
{
"aggs" : {
"prices" : {
"histogram" : {
"field" : "price",
"interval" : 50,
"order" : { "promoted_products>rating_stats.avg" : "desc" }
},
"aggs" : {
"promoted_products" : {
"filter" : { "term" : { "promoted" : true }},
"aggs" : {
"rating_stats" : { "stats" : { "field" : "rating" }}
}
}
}
}
}
}
The above will sort the buckets based on the avg rating among the promoted products
49.9.3. Offset
By default the bucket keys start with 0 and then continue in even spaced steps of interval, e.g. if the interval is 10 the first buckets
(assuming there is data inside them) will be [0 - 9], [10-19], [20-29]. The bucket boundaries can be shifted by using the offset option.
This can be best illustrated with an example. If there are 10 documents with values ranging from 5 to 14, using interval 10 will result in
two buckets with 5 documents each. If an additional offset 5 is used, there will be only one single bucket [5-14] containing all the 10
documents.
49.9.4. Response Format
By default, the buckets are returned as an ordered array. It is also possible to request the response as a hash instead keyed by the buckets keys:
{
"aggs" : {
"prices" : {
"histogram" : {
"field" : "price",
"interval" : 50,
"keyed" : true
}
}
}
}
Response:
{
"aggregations": {
"prices": {
"buckets": {
"0": {
"key": 0,
"doc_count": 2
},
"50": {
"key": 50,
"doc_count": 4
},
"150": {
"key": 150,
"doc_count": 3
}
}
}
}
}
49.9.5. Missing value
The missing parameter defines how documents that are missing a value should be treated.
By default they will be ignored but it is also possible to treat them as if they
had a value.
{
"aggs" : {
"quantity" : {
"histogram" : {
"field" : "quantity",
"interval": 10,
"missing": 0
}
}
}
}
Documents without a value in the quantity field will fall into the same bucket as documents that have the value 0. |
49.10. IPv4 Range Aggregation
Just like the dedicated date range aggregation, there is also a dedicated range aggregation for IPv4 typed fields:
Example:
{
"aggs" : {
"ip_ranges" : {
"ip_range" : {
"field" : "ip",
"ranges" : [
{ "to" : "10.0.0.5" },
{ "from" : "10.0.0.5" }
]
}
}
}
}
Response:
{
...
"aggregations": {
"ip_ranges": {
"buckets" : [
{
"to": 167772165,
"to_as_string": "10.0.0.5",
"doc_count": 4
},
{
"from": 167772165,
"from_as_string": "10.0.0.5",
"doc_count": 6
}
]
}
}
}
IP ranges can also be defined as CIDR masks:
{
"aggs" : {
"ip_ranges" : {
"ip_range" : {
"field" : "ip",
"ranges" : [
{ "mask" : "10.0.0.0/25" },
{ "mask" : "10.0.0.127/25" }
]
}
}
}
}
Response:
{
"aggregations": {
"ip_ranges": {
"buckets": [
{
"key": "10.0.0.0/25",
"from": 1.6777216E+8,
"from_as_string": "10.0.0.0",
"to": 167772287,
"to_as_string": "10.0.0.127",
"doc_count": 127
},
{
"key": "10.0.0.127/25",
"from": 1.6777216E+8,
"from_as_string": "10.0.0.0",
"to": 167772287,
"to_as_string": "10.0.0.127",
"doc_count": 127
}
]
}
}
}
49.11. Missing Aggregation
A field data based single bucket aggregation, that creates a bucket of all documents in the current document set context that are missing a field value (effectively, missing a field or having the configured NULL value set). This aggregator will often be used in conjunction with other field data bucket aggregators (such as ranges) to return information for all the documents that could not be placed in any of the other buckets due to missing field data values.
Example:
{
"aggs" : {
"products_without_a_price" : {
"missing" : { "field" : "price" }
}
}
}
In the above example, we get the total number of products that do not have a price.
Response:
{
...
"aggs" : {
"products_without_a_price" : {
"doc_count" : 10
}
}
}
49.12. Nested Aggregation
A special single bucket aggregation that enables aggregating nested documents.
For example, lets say we have a index of products, and each product holds the list of resellers - each having its own price for the product. The mapping could look like:
{
...
"product" : {
"properties" : {
"resellers" : {
"type" : "nested",
"properties" : {
"name" : { "type" : "string" },
"price" : { "type" : "double" }
}
}
}
}
}
The resellers is an array that holds nested documents under the product object. |
The following aggregations will return the minimum price products can be purchased in:
{
"query" : {
"match" : { "name" : "led tv" }
},
"aggs" : {
"resellers" : {
"nested" : {
"path" : "resellers"
},
"aggs" : {
"min_price" : { "min" : { "field" : "resellers.price" } }
}
}
}
}
As you can see above, the nested aggregation requires the path of the nested documents within the top level documents.
Then one can define any type of aggregation over these nested documents.
Response:
{
"aggregations": {
"resellers": {
"min_price": {
"value" : 350
}
}
}
}
49.13. Range Aggregation
A multi-bucket value source based aggregation that enables the user to define a set of ranges - each representing a bucket. During the aggregation process, the values extracted from each document will be checked against each bucket range and "bucket" the relevant/matching document.
Note that this aggregation includes the from value and excludes the to value for each range.
Example:
{
"aggs" : {
"price_ranges" : {
"range" : {
"field" : "price",
"ranges" : [
{ "to" : 50 },
{ "from" : 50, "to" : 100 },
{ "from" : 100 }
]
}
}
}
}
Response:
{
...
"aggregations": {
"price_ranges" : {
"buckets": [
{
"to": 50,
"doc_count": 2
},
{
"from": 50,
"to": 100,
"doc_count": 4
},
{
"from": 100,
"doc_count": 4
}
]
}
}
}
49.13.1. Keyed Response
Setting the keyed flag to true will associate a unique string key with each bucket and return the ranges as a hash rather than an array:
{
"aggs" : {
"price_ranges" : {
"range" : {
"field" : "price",
"keyed" : true,
"ranges" : [
{ "to" : 50 },
{ "from" : 50, "to" : 100 },
{ "from" : 100 }
]
}
}
}
}
Response:
{
...
"aggregations": {
"price_ranges" : {
"buckets": {
"*-50.0": {
"to": 50,
"doc_count": 2
},
"50.0-100.0": {
"from": 50,
"to": 100,
"doc_count": 4
},
"100.0-*": {
"from": 100,
"doc_count": 4
}
}
}
}
}
It is also possible to customize the key for each range:
{
"aggs" : {
"price_ranges" : {
"range" : {
"field" : "price",
"keyed" : true,
"ranges" : [
{ "key" : "cheap", "to" : 50 },
{ "key" : "average", "from" : 50, "to" : 100 },
{ "key" : "expensive", "from" : 100 }
]
}
}
}
}
49.13.2. Script
{
"aggs" : {
"price_ranges" : {
"range" : {
"script" : "doc['price'].value",
"ranges" : [
{ "to" : 50 },
{ "from" : 50, "to" : 100 },
{ "from" : 100 }
]
}
}
}
}
This will interpret the script parameter as an inline script with the default script language and no script parameters. To use a file script use the following syntax:
{
"aggs" : {
"price_ranges" : {
"range" : {
"script" : {
"file": "my_script",
"params": {
"field": "price"
}
},
"ranges" : [
{ "to" : 50 },
{ "from" : 50, "to" : 100 },
{ "from" : 100 }
]
}
}
}
}
|
|
for indexed scripts replace the file parameter with an id parameter.
|
49.13.3. Value Script
Lets say the product prices are in USD but we would like to get the price ranges in EURO. We can use value script to convert the prices prior the aggregation (assuming conversion rate of 0.8)
{
"aggs" : {
"price_ranges" : {
"range" : {
"field" : "price",
"script" : "_value * conversion_rate",
"params" : {
"conversion_rate" : 0.8
},
"ranges" : [
{ "to" : 35 },
{ "from" : 35, "to" : 70 },
{ "from" : 70 }
]
}
}
}
}
49.13.4. Sub Aggregations
The following example, not only "bucket" the documents to the different buckets but also computes statistics over the prices in each price range
{
"aggs" : {
"price_ranges" : {
"range" : {
"field" : "price",
"ranges" : [
{ "to" : 50 },
{ "from" : 50, "to" : 100 },
{ "from" : 100 }
]
},
"aggs" : {
"price_stats" : {
"stats" : { "field" : "price" }
}
}
}
}
}
Response:
{
"aggregations": {
"price_ranges" : {
"buckets": [
{
"to": 50,
"doc_count": 2,
"price_stats": {
"count": 2,
"min": 20,
"max": 47,
"avg": 33.5,
"sum": 67
}
},
{
"from": 50,
"to": 100,
"doc_count": 4,
"price_stats": {
"count": 4,
"min": 60,
"max": 98,
"avg": 82.5,
"sum": 330
}
},
{
"from": 100,
"doc_count": 4,
"price_stats": {
"count": 4,
"min": 134,
"max": 367,
"avg": 216,
"sum": 864
}
}
]
}
}
}
If a sub aggregation is also based on the same value source as the range aggregation (like the stats aggregation in the example above) it is possible to leave out the value source definition for it. The following will return the same response as above:
{
"aggs" : {
"price_ranges" : {
"range" : {
"field" : "price",
"ranges" : [
{ "to" : 50 },
{ "from" : 50, "to" : 100 },
{ "from" : 100 }
]
},
"aggs" : {
"price_stats" : {
"stats" : {}
}
}
}
}
}
We don’t need to specify the price as we "inherit" it by default from the parent range aggregation |
49.14. Reverse nested Aggregation
A special single bucket aggregation that enables aggregating on parent docs from nested documents. Effectively this aggregation can break out of the nested block structure and link to other nested structures or the root document, which allows nesting other aggregations that aren’t part of the nested object in a nested aggregation.
The reverse_nested aggregation must be defined inside a nested aggregation.
-
path- Which defines to what nested object field should be joined back. The default is empty, which means that it joins back to the root / main document level. The path cannot contain a reference to a nested object field that falls outside thenestedaggregation’s nested structure areverse_nestedis in.
For example, lets say we have an index for a ticket system with issues and comments. The comments are inlined into the issue documents as nested documents. The mapping could look like:
{
...
"issue" : {
"properties" : {
"tags" : { "type" : "string" }
"comments" : {
"type" : "nested"
"properties" : {
"username" : { "type" : "string", "index" : "not_analyzed" },
"comment" : { "type" : "string" }
}
}
}
}
}
The comments is an array that holds nested documents under the issue object. |
The following aggregations will return the top commenters' username that have commented and per top commenter the top tags of the issues the user has commented on:
{
"query": {
"match": {
"name": "led tv"
}
},
"aggs": {
"comments": {
"nested": {
"path": "comments"
},
"aggs": {
"top_usernames": {
"terms": {
"field": "comments.username"
},
"aggs": {
"comment_to_issue": {
"reverse_nested": {},
"aggs": {
"top_tags_per_comment": {
"terms": {
"field": "tags"
}
}
}
}
}
}
}
}
}
}
As you can see above, the reverse_nested aggregation is put in to a nested aggregation as this is the only place
in the dsl where the reversed_nested aggregation can be used. Its sole purpose is to join back to a parent doc higher
up in the nested structure.
A reverse_nested aggregation that joins back to the root / main document level, because no path has been defined.
Via the path option the reverse_nested aggregation can join back to a different level, if multiple layered nested
object types have been defined in the mapping |
Possible response snippet:
{
"aggregations": {
"comments": {
"top_usernames": {
"buckets": [
{
"key": "username_1",
"doc_count": 12,
"comment_to_issue": {
"top_tags_per_comment": {
"buckets": [
{
"key": "tag1",
"doc_count": 9
},
...
]
}
}
},
...
]
}
}
}
}
49.15. Sampler Aggregation
experimental[]
A filtering aggregation used to limit any sub aggregations' processing to a sample of the top-scoring documents. Optionally, diversity settings can be used to limit the number of matches that share a common value such as an "author".
-
Tightening the focus of analytics to high-relevance matches rather than the potentially very long tail of low-quality matches
-
Removing bias from analytics by ensuring fair representation of content from different sources
-
Reducing the running cost of aggregations that can produce useful results using only samples e.g.
significant_terms
Example:
{
"query": {
"match": {
"text": "iphone"
}
},
"aggs": {
"sample": {
"sampler": {
"shard_size": 200,
"field" : "user.id"
},
"aggs": {
"keywords": {
"significant_terms": {
"field": "text"
}
}
}
}
}
}
Response:
{
...
"aggregations": {
"sample": {
"doc_count": 1000,
"keywords": {
"doc_count": 1000,
"buckets": [
...
{
"key": "bend",
"doc_count": 58,
"score": 37.982536582524276,
"bg_count": 103
},
....
}
| 1000 documents were sampled in total becase we asked for a maximum of 200 from an index with 5 shards. The cost of performing the nested significant_terms aggregation was therefore limited rather than unbounded. | |
| The results of the significant_terms aggregation are not skewed by any single over-active Twitter user because we asked for a maximum of one tweet from any one user in our sample. |
49.15.1. shard_size
The shard_size parameter limits how many top-scoring documents are collected in the sample processed on each shard.
The default value is 100.
49.15.2. Controlling diversity
Optionally, you can use the field or script and max_docs_per_value settings to control the maximum number of documents collected on any one shard which share a common value.
The choice of value (e.g. author) is loaded from a regular field or derived dynamically by a script.
The aggregation will throw an error if the choice of field or script produces multiple values for a document. It is currently not possible to offer this form of de-duplication using many values, primarily due to concerns over efficiency.
|
|
Any good market researcher will tell you that when working with samples of data it is important that the sample represents a healthy variety of opinions rather than being skewed by any single voice. The same is true with aggregations and sampling with these diversify settings can offer a way to remove the bias in your content (an over-populated geography, a large spike in a timeline or an over-active forum spammer). |
49.15.3. Field
Controlling diversity using a field:
{
"aggs" : {
"sample" : {
"sampler" : {
"field" : "author",
"max_docs_per_value" : 3
}
}
}
}
Note that the max_docs_per_value setting applies on a per-shard basis only for the purposes of shard-local sampling.
It is not intended as a way of providing a global de-duplication feature on search results.
49.15.4. Script
Controlling diversity using a script:
{
"aggs" : {
"sample" : {
"sampler" : {
"script" : "doc['author'].value + '/' + doc['genre'].value"
}
}
}
}
Note in the above example we chose to use the default max_docs_per_value setting of 1 and combine author and genre fields to ensure
each shard sample has, at most, one match for an author/genre pair.
49.15.5. execution_hint
When using the settings to control diversity, the optional execution_hint setting can influence the management of the values used for de-duplication.
Each option will hold up to shard_size values in memory while performing de-duplication but the type of value held can be controlled as follows:
-
hold field values directly (
map) -
hold ordinals of the field as determined by the Lucene index (
global_ordinals) -
hold hashes of the field values - with potential for hash collisions (
bytes_hash)
The default setting is to use global_ordinals if this information is available from the Lucene index and reverting to map if not.
The bytes_hash setting may prove faster in some cases but introduces the possibility of false positives in de-duplication logic due to the possibility of hash collisions.
Please note that Elasticsearch will ignore the choice of execution hint if it is not applicable and that there is no backward compatibility guarantee on these hints.
49.15.6. Limitations
Cannot be nested under breadth_first aggregations
Being a quality-based filter the sampler aggregation needs access to the relevance score produced for each document.
It therefore cannot be nested under a terms aggregation which has the collect_mode switched from the default depth_first mode to breadth_first as this discards scores.
In this situation an error will be thrown.
Limited de-dup logic.
The de-duplication logic in the diversify settings applies only at a shard level so will not apply across shards.
No specialized syntax for geo/date fields
Currently the syntax for defining the diversifying values is defined by a choice of field or script - there is no added syntactical sugar for expressing geo or date units such as "1w" (1 week).
This support may be added in a later release and users will currently have to create these sorts of values using a script.
49.16. Significant Terms Aggregation
An aggregation that returns interesting or unusual occurrences of terms in a set.
experimental[The significant_terms aggregation can be very heavy when run on large indices. Work is in progress to provide more lightweight sampling techniques. As a result, the API for this feature may change in non-backwards compatible ways]
-
Suggesting "H5N1" when users search for "bird flu" in text
-
Identifying the merchant that is the "common point of compromise" from the transaction history of credit card owners reporting loss
-
Suggesting keywords relating to stock symbol $ATI for an automated news classifier
-
Spotting the fraudulent doctor who is diagnosing more than his fair share of whiplash injuries
-
Spotting the tire manufacturer who has a disproportionate number of blow-outs
In all these cases the terms being selected are not simply the most popular terms in a set. They are the terms that have undergone a significant change in popularity measured between a foreground and background set. If the term "H5N1" only exists in 5 documents in a 10 million document index and yet is found in 4 of the 100 documents that make up a user’s search results that is significant and probably very relevant to their search. 5/10,000,000 vs 4/100 is a big swing in frequency.
49.16.1. Single-set analysis
In the simplest case, the foreground set of interest is the search results matched by a query and the background set used for statistical comparisons is the index or indices from which the results were gathered.
Example:
{
"query" : {
"terms" : {"force" : [ "British Transport Police" ]}
},
"aggregations" : {
"significantCrimeTypes" : {
"significant_terms" : { "field" : "crime_type" }
}
}
}
Response:
{
...
"aggregations" : {
"significantCrimeTypes" : {
"doc_count": 47347,
"buckets" : [
{
"key": "Bicycle theft",
"doc_count": 3640,
"score": 0.371235374214817,
"bg_count": 66799
}
...
]
}
}
}
When querying an index of all crimes from all police forces, what these results show is that the British Transport Police force stand out as a force dealing with a disproportionately large number of bicycle thefts. Ordinarily, bicycle thefts represent only 1% of crimes (66799/5064554) but for the British Transport Police, who handle crime on railways and stations, 7% of crimes (3640/47347) is a bike theft. This is a significant seven-fold increase in frequency and so this anomaly was highlighted as the top crime type.
The problem with using a query to spot anomalies is it only gives us one subset to use for comparisons. To discover all the other police forces' anomalies we would have to repeat the query for each of the different forces.
This can be a tedious way to look for unusual patterns in an index
49.16.2. Multi-set analysis
A simpler way to perform analysis across multiple categories is to use a parent-level aggregation to segment the data ready for analysis.
Example using a parent aggregation for segmentation:
{
"aggregations": {
"forces": {
"terms": {"field": "force"},
"aggregations": {
"significantCrimeTypes": {
"significant_terms": {"field": "crime_type"}
}
}
}
}
}
Response:
{
...
"aggregations": {
"forces": {
"buckets": [
{
"key": "Metropolitan Police Service",
"doc_count": 894038,
"significantCrimeTypes": {
"doc_count": 894038,
"buckets": [
{
"key": "Robbery",
"doc_count": 27617,
"score": 0.0599,
"bg_count": 53182
},
...
]
}
},
{
"key": "British Transport Police",
"doc_count": 47347,
"significantCrimeTypes": {
"doc_count": 47347,
"buckets": [
{
"key": "Bicycle theft",
"doc_count": 3640,
"score": 0.371,
"bg_count": 66799
},
...
]
}
}
]
}
}
Now we have anomaly detection for each of the police forces using a single request.
We can use other forms of top-level aggregations to segment our data, for example segmenting by geographic area to identify unusual hot-spots of a particular crime type:
{
"aggs": {
"hotspots": {
"geohash_grid" : {
"field":"location",
"precision":5,
},
"aggs": {
"significantCrimeTypes": {
"significant_terms": {"field": "crime_type"}
}
}
}
}
}
This example uses the geohash_grid aggregation to create result buckets that represent geographic areas, and inside each
bucket we can identify anomalous levels of a crime type in these tightly-focused areas e.g.
-
Airports exhibit unusual numbers of weapon confiscations
-
Universities show uplifts of bicycle thefts
At a higher geohash_grid zoom-level with larger coverage areas we would start to see where an entire police-force may be tackling an unusual volume of a particular crime type.
Obviously a time-based top-level segmentation would help identify current trends for each point in time
where a simple terms aggregation would typically show the very popular "constants" that persist across all time slots.
49.16.3. Use on free-text fields
The significant_terms aggregation can be used effectively on tokenized free-text fields to suggest:
-
keywords for refining end-user searches
-
keywords for use in percolator queries
|
|
Picking a free-text field as the subject of a significant terms analysis can be expensive! It will attempt to load every unique word into RAM. It is recommended to only use this on smaller indices. |
|
|
Free-text significant_terms are much more easily understood when viewed in context. Take the results of |
49.16.4. Custom background sets
Ordinarily, the foreground set of documents is "diffed" against a background set of all the documents in your index.
However, sometimes it may prove useful to use a narrower background set as the basis for comparisons.
For example, a query on documents relating to "Madrid" in an index with content from all over the world might reveal that "Spanish"
was a significant term. This may be true but if you want some more focused terms you could use a background_filter
on the term spain to establish a narrower set of documents as context. With this as a background "Spanish" would now
be seen as commonplace and therefore not as significant as words like "capital" that relate more strongly with Madrid.
Note that using a background filter will slow things down - each term’s background frequency must now be derived on-the-fly from filtering posting lists rather than reading the index’s pre-computed count for a term.
49.16.5. Limitations
Significant terms must be indexed values
Unlike the terms aggregation it is currently not possible to use script-generated terms for counting purposes. Because of the way the significant_terms aggregation must consider both foreground and background frequencies it would be prohibitively expensive to use a script on the entire index to obtain background frequencies for comparisons. Also DocValues are not supported as sources of term data for similar reasons.
No analysis of floating point fields
Floating point fields are currently not supported as the subject of significant_terms analysis. While integer or long fields can be used to represent concepts like bank account numbers or category numbers which can be interesting to track, floating point fields are usually used to represent quantities of something. As such, individual floating point terms are not useful for this form of frequency analysis.
Use as a parent aggregation
If there is the equivalent of a match_all query or no query criteria providing a subset of the index the significant_terms aggregation should not be used as the
top-most aggregation - in this scenario the foreground set is exactly the same as the background set and
so there is no difference in document frequencies to observe and from which to make sensible suggestions.
Another consideration is that the significant_terms aggregation produces many candidate results at shard level that are only later pruned on the reducing node once all statistics from all shards are merged. As a result, it can be inefficient and costly in terms of RAM to embed large child aggregations under a significant_terms aggregation that later discards many candidate terms. It is advisable in these cases to perform two searches - the first to provide a rationalized list of significant_terms and then add this shortlist of terms to a second query to go back and fetch the required child aggregations.
Approximate counts
The counts of how many documents contain a term provided in results are based on summing the samples returned from each shard and as such may be:
-
low if certain shards did not provide figures for a given term in their top sample
-
high when considering the background frequency as it may count occurrences found in deleted documents
Like most design decisions, this is the basis of a trade-off in which we have chosen to provide fast performance at the cost of some (typically small) inaccuracies.
However, the size and shard size settings covered in the next section provide tools to help control the accuracy levels.
49.16.6. Parameters
JLH score
The scores are derived from the doc frequencies in foreground and background sets. The absolute change in popularity (foregroundPercent - backgroundPercent) would favor common terms whereas the relative change in popularity (foregroundPercent/ backgroundPercent) would favor rare terms. Rare vs common is essentially a precision vs recall balance and so the absolute and relative changes are multiplied to provide a sweet spot between precision and recall.
mutual information
Mutual information as described in "Information Retrieval", Manning et al., Chapter 13.5.1 can be used as significance score by adding the parameter
"mutual_information": {
"include_negatives": true
}
Mutual information does not differentiate between terms that are descriptive for the subset or for documents outside the subset. The significant terms therefore can contain terms that appear more or less frequent in the subset than outside the subset. To filter out the terms that appear less often in the subset than in documents outside the subset, include_negatives can be set to false.
Per default, the assumption is that the documents in the bucket are also contained in the background. If instead you defined a custom background filter that represents a different set of documents that you want to compare to, set
"background_is_superset": false
Chi square
Chi square as described in "Information Retrieval", Manning et al., Chapter 13.5.2 can be used as significance score by adding the parameter
"chi_square": {
}
Chi square behaves like mutual information and can be configured with the same parameters include_negatives and background_is_superset.
google normalized distance
Google normalized distance as described in "The Google Similarity Distance", Cilibrasi and Vitanyi, 2007 (http://arxiv.org/pdf/cs/0412098v3.pdf) can be used as significance score by adding the parameter
"gnd": {
}
gnd also accepts the background_is_superset parameter.
Percentage
A simple calculation of the number of documents in the foreground sample with a term divided by the number of documents in the background with the term. By default this produces a score greater than zero and less than one.
The benefit of this heuristic is that the scoring logic is simple to explain to anyone familiar with a "per capita" statistic. However, for fields with high cardinality there is a tendency for this heuristic to select the rarest terms such as typos that occur only once because they score 1/1 = 100%.
It would be hard for a seasoned boxer to win a championship if the prize was awarded purely on the basis of percentage of fights won - by these rules a newcomer with only one fight under his belt would be impossible to beat.
Multiple observations are typically required to reinforce a view so it is recommended in these cases to set both min_doc_count and shard_min_doc_count to a higher value such as 10 in order to filter out the low-frequency terms that otherwise take precedence.
"percentage": {
}
Which one is best?
Roughly, mutual_information prefers high frequent terms even if they occur also frequently in the background. For example, in an analysis of natural language text this might lead to selection of stop words. mutual_information is unlikely to select very rare terms like misspellings. gnd prefers terms with a high co-occurrence and avoids selection of stopwords. It might be better suited for synonym detection. However, gnd has a tendency to select very rare terms that are, for example, a result of misspelling. chi_square and jlh are somewhat in-between.
It is hard to say which one of the different heuristics will be the best choice as it depends on what the significant terms are used for (see for example [Yang and Pedersen, "A Comparative Study on Feature Selection in Text Categorization", 1997](http://courses.ischool.berkeley.edu/i256/f06/papers/yang97comparative.pdf) for a study on using significant terms for feature selection for text classification).
If none of the above measures suits your usecase than another option is to implement a custom significance measure:
scripted
Customized scores can be implemented via a script:
"script_heuristic": {
"script": "_subset_freq/(_superset_freq - _subset_freq + 1)"
}
Scripts can be inline (as in above example), indexed or stored on disk. For details on the options, see script documentation.
Available parameters in the script are
_subset_freq
|
Number of documents the term appears in in the subset. |
_superset_freq
|
Number of documents the term appears in in the superset. |
_subset_size
|
Number of documents in the subset. |
_superset_size
|
Number of documents in the superset. |
Size & Shard Size
The size parameter can be set to define how many term buckets should be returned out of the overall terms list. By
default, the node coordinating the search process will request each shard to provide its own top term buckets
and once all shards respond, it will reduce the results to the final list that will then be returned to the client.
If the number of unique terms is greater than size, the returned list can be slightly off and not accurate
(it could be that the term counts are slightly off and it could even be that a term that should have been in the top
size buckets was not returned).
If set to 0, the size will be set to Integer.MAX_VALUE.
To ensure better accuracy a multiple of the final size is used as the number of terms to request from each shard
using a heuristic based on the number of shards. To take manual control of this setting the shard_size parameter
can be used to control the volumes of candidate terms produced by each shard.
Low-frequency terms can turn out to be the most interesting ones once all results are combined so the
significant_terms aggregation can produce higher-quality results when the shard_size parameter is set to
values significantly higher than the size setting. This ensures that a bigger volume of promising candidate terms are given
a consolidated review by the reducing node before the final selection. Obviously large candidate term lists
will cause extra network traffic and RAM usage so this is quality/cost trade off that needs to be balanced. If shard_size is set to -1 (the default) then shard_size will be automatically estimated based on the number of shards and the size parameter.
If set to 0, the shard_size will be set to Integer.MAX_VALUE.
|
|
shard_size cannot be smaller than size (as it doesn’t make much sense). When it is, elasticsearch will
override it and reset it to be equal to size.
|
Minimum document count
It is possible to only return terms that match more than a configured number of hits using the min_doc_count option:
{
"aggs" : {
"tags" : {
"significant_terms" : {
"field" : "tag",
"min_doc_count": 10
}
}
}
}
The above aggregation would only return tags which have been found in 10 hits or more. Default value is 3.
Terms that score highly will be collected on a shard level and merged with the terms collected from other shards in a second step. However, the shard does not have the information about the global term frequencies available. The decision if a term is added to a candidate list depends only on the score computed on the shard using local shard frequencies, not the global frequencies of the word. The min_doc_count criterion is only applied after merging local terms statistics of all shards. In a way the decision to add the term as a candidate is made without being very certain about if the term will actually reach the required min_doc_count. This might cause many (globally) high frequent terms to be missing in the final result if low frequent but high scoring terms populated the candidate lists. To avoid this, the shard_size parameter can be increased to allow more candidate terms on the shards. However, this increases memory consumption and network traffic.
shard_min_doc_count parameter
The parameter shard_min_doc_count regulates the certainty a shard has if the term should actually be added to the candidate list or not with respect to the min_doc_count. Terms will only be considered if their local shard frequency within the set is higher than the shard_min_doc_count. If your dictionary contains many low frequent words and you are not interested in these (for example misspellings), then you can set the shard_min_doc_count parameter to filter out candidate terms on a shard level that will with a reasonable certainty not reach the required min_doc_count even after merging the local frequencies. shard_min_doc_count is set to 1 per default and has no effect unless you explicitly set it.
|
|
Setting min_doc_count to 1 is generally not advised as it tends to return terms that
are typos or other bizarre curiosities. Finding more than one instance of a term helps
reinforce that, while still rare, the term was not the result of a one-off accident. The
default value of 3 is used to provide a minimum weight-of-evidence.
Setting shard_min_doc_count too high will cause significant candidate terms to be filtered out on a shard level. This value should be set much lower than min_doc_count/#shards.
|
Custom background context
The default source of statistical information for background term frequencies is the entire index and this
scope can be narrowed through the use of a background_filter to focus in on significant terms within a narrower
context:
{
"query" : {
"match" : "madrid"
},
"aggs" : {
"tags" : {
"significant_terms" : {
"field" : "tag",
"background_filter": {
"term" : { "text" : "spain"}
}
}
}
}
}
The above filter would help focus in on terms that were peculiar to the city of Madrid rather than revealing terms like "Spanish" that are unusual in the full index’s worldwide context but commonplace in the subset of documents containing the word "Spain".
|
|
Use of background filters will slow the query as each term’s postings must be filtered to determine a frequency |
Filtering Values
It is possible (although rarely required) to filter the values for which buckets will be created. This can be done using the include and
exclude parameters which are based on a regular expression string or arrays of exact terms. This functionality mirrors the features
described in the terms aggregation documentation.
Execution hint
There are different mechanisms by which terms aggregations can be executed:
-
by using field values directly in order to aggregate data per-bucket (
map) -
by using ordinals of the field and preemptively allocating one bucket per ordinal value (
global_ordinals) -
by using ordinals of the field and dynamically allocating one bucket per ordinal value (
global_ordinals_hash)
Elasticsearch tries to have sensible defaults so this is something that generally doesn’t need to be configured.
map should only be considered when very few documents match a query. Otherwise the ordinals-based execution modes
are significantly faster. By default, map is only used when running an aggregation on scripts, since they don’t have
ordinals.
global_ordinals is the second fastest option, but the fact that it preemptively allocates buckets can be memory-intensive,
especially if you have one or more sub aggregations. It is used by default on top-level terms aggregations.
global_ordinals_hash on the contrary to global_ordinals and global_ordinals_low_cardinality allocates buckets dynamically
so memory usage is linear to the number of values of the documents that are part of the aggregation scope. It is used by default
in inner aggregations.
{
"aggs" : {
"tags" : {
"significant_terms" : {
"field" : "tags",
"execution_hint": "map"
}
}
}
}
the possible values are map, global_ordinals and global_ordinals_hash |
Please note that Elasticsearch will ignore this execution hint if it is not applicable.
49.17. Terms Aggregation
A multi-bucket value source based aggregation where buckets are dynamically built - one per unique value.
Example:
{
"aggs" : {
"genders" : {
"terms" : { "field" : "gender" }
}
}
}
Response:
{
...
"aggregations" : {
"genders" : {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets" : [
{
"key" : "male",
"doc_count" : 10
},
{
"key" : "female",
"doc_count" : 10
},
]
}
}
}
| an upper bound of the error on the document counts for each term, see below | |
| when there are lots of unique terms, elasticsearch only returns the top terms; this number is the sum of the document counts for all buckets that are not part of the response | |
the list of the top buckets, the meaning of top being defined by the order |
By default, the terms aggregation will return the buckets for the top ten terms ordered by the doc_count. One can
change this default behaviour by setting the size parameter.
49.17.1. Size
The size parameter can be set to define how many term buckets should be returned out of the overall terms list. By
default, the node coordinating the search process will request each shard to provide its own top size term buckets
and once all shards respond, it will reduce the results to the final list that will then be returned to the client.
This means that if the number of unique terms is greater than size, the returned list is slightly off and not accurate
(it could be that the term counts are slightly off and it could even be that a term that should have been in the top
size buckets was not returned). If set to 0, the size will be set to Integer.MAX_VALUE.
49.17.2. Document counts are approximate
As described above, the document counts (and the results of any sub aggregations) in the terms aggregation are not always accurate. This is because each shard provides its own view of what the ordered list of terms should be and these are combined to give a final view. Consider the following scenario:
A request is made to obtain the top 5 terms in the field product, ordered by descending document count from an index with 3 shards. In this case each shard is asked to give its top 5 terms.
{
"aggs" : {
"products" : {
"terms" : {
"field" : "product",
"size" : 5
}
}
}
}
The terms for each of the three shards are shown below with their respective document counts in brackets:
| Shard A | Shard B | Shard C | |
|---|---|---|---|
1 |
Product A (25) |
Product A (30) |
Product A (45) |
2 |
Product B (18) |
Product B (25) |
Product C (44) |
3 |
Product C (6) |
Product F (17) |
Product Z (36) |
4 |
Product D (3) |
Product Z (16) |
Product G (30) |
5 |
Product E (2) |
Product G (15) |
Product E (29) |
6 |
Product F (2) |
Product H (14) |
Product H (28) |
7 |
Product G (2) |
Product I (10) |
Product Q (2) |
8 |
Product H (2) |
Product Q (6) |
Product D (1) |
9 |
Product I (1) |
Product J (8) |
|
10 |
Product J (1) |
Product C (4) |
The shards will return their top 5 terms so the results from the shards will be:
| Shard A | Shard B | Shard C | |
|---|---|---|---|
1 |
Product A (25) |
Product A (30) |
Product A (45) |
2 |
Product B (18) |
Product B (25) |
Product C (44) |
3 |
Product C (6) |
Product F (17) |
Product Z (36) |
4 |
Product D (3) |
Product Z (16) |
Product G (30) |
5 |
Product E (2) |
Product G (15) |
Product E (29) |
Taking the top 5 results from each of the shards (as requested) and combining them to make a final top 5 list produces the following:
1 |
Product A (100) |
2 |
Product Z (52) |
3 |
Product C (50) |
4 |
Product G (45) |
5 |
Product B (43) |
Because Product A was returned from all shards we know that its document count value is accurate. Product C was only returned by shards A and C so its document count is shown as 50 but this is not an accurate count. Product C exists on shard B, but its count of 4 was not high enough to put Product C into the top 5 list for that shard. Product Z was also returned only by 2 shards but the third shard does not contain the term. There is no way of knowing, at the point of combining the results to produce the final list of terms, that there is an error in the document count for Product C and not for Product Z. Product H has a document count of 44 across all 3 shards but was not included in the final list of terms because it did not make it into the top five terms on any of the shards.
49.17.3. Shard Size
The higher the requested size is, the more accurate the results will be, but also, the more expensive it will be to
compute the final results (both due to bigger priority queues that are managed on a shard level and due to bigger data
transfers between the nodes and the client).
The shard_size parameter can be used to minimize the extra work that comes with bigger requested size. When defined,
it will determine how many terms the coordinating node will request from each shard. Once all the shards responded, the
coordinating node will then reduce them to a final result which will be based on the size parameter - this way,
one can increase the accuracy of the returned terms and avoid the overhead of streaming a big list of buckets back to
the client. If set to 0, the shard_size will be set to Integer.MAX_VALUE.
|
|
shard_size cannot be smaller than size (as it doesn’t make much sense). When it is, elasticsearch will
override it and reset it to be equal to size.
|
It is possible to not limit the number of terms that are returned by setting size to 0. Don’t use this
on high-cardinality fields as this will kill both your CPU since terms need to be return sorted, and your network.
The default shard_size is a multiple of the size parameter which is dependant on the number of shards.
49.17.4. Calculating Document Count Error
There are two error values which can be shown on the terms aggregation. The first gives a value for the aggregation as a whole which represents the maximum potential document count for a term which did not make it into the final list of terms. This is calculated as the sum of the document count from the last term returned from each shard .For the example given above the value would be 46 (2 + 15 + 29). This means that in the worst case scenario a term which was not returned could have the 4th highest document count.
{
...
"aggregations" : {
"products" : {
"doc_count_error_upper_bound" : 46,
"buckets" : [
{
"key" : "Product A",
"doc_count" : 100
},
{
"key" : "Product Z",
"doc_count" : 52
},
...
]
}
}
}
49.17.5. Per bucket document count error
experimental[]
The second error value can be enabled by setting the show_term_doc_count_error parameter to true. This shows an error value
for each term returned by the aggregation which represents the worst case error in the document count and can be useful when
deciding on a value for the shard_size parameter. This is calculated by summing the document counts for the last term returned
by all shards which did not return the term. In the example above the error in the document count for Product C would be 15 as
Shard B was the only shard not to return the term and the document count of the last termit did return was 15. The actual document
count of Product C was 54 so the document count was only actually off by 4 even though the worst case was that it would be off by
15. Product A, however has an error of 0 for its document count, since every shard returned it we can be confident that the count
returned is accurate.
{
...
"aggregations" : {
"products" : {
"doc_count_error_upper_bound" : 46,
"buckets" : [
{
"key" : "Product A",
"doc_count" : 100,
"doc_count_error_upper_bound" : 0
},
{
"key" : "Product Z",
"doc_count" : 52,
"doc_count_error_upper_bound" : 2
},
...
]
}
}
}
These errors can only be calculated in this way when the terms are ordered by descending document count. When the aggregation is ordered by the terms values themselves (either ascending or descending) there is no error in the document count since if a shard does not return a particular term which appears in the results from another shard, it must not have that term in its index. When the aggregation is either sorted by a sub aggregation or in order of ascending document count, the error in the document counts cannot be determined and is given a value of -1 to indicate this.
49.17.6. Order
The order of the buckets can be customized by setting the order parameter. By default, the buckets are ordered by
their doc_count descending. It is also possible to change this behaviour as follows:
Ordering the buckets by their doc_count in an ascending manner:
{
"aggs" : {
"genders" : {
"terms" : {
"field" : "gender",
"order" : { "_count" : "asc" }
}
}
}
}
Ordering the buckets alphabetically by their terms in an ascending manner:
{
"aggs" : {
"genders" : {
"terms" : {
"field" : "gender",
"order" : { "_term" : "asc" }
}
}
}
}
Ordering the buckets by single value metrics sub-aggregation (identified by the aggregation name):
{
"aggs" : {
"genders" : {
"terms" : {
"field" : "gender",
"order" : { "avg_height" : "desc" }
},
"aggs" : {
"avg_height" : { "avg" : { "field" : "height" } }
}
}
}
}
Ordering the buckets by multi value metrics sub-aggregation (identified by the aggregation name):
{
"aggs" : {
"genders" : {
"terms" : {
"field" : "gender",
"order" : { "height_stats.avg" : "desc" }
},
"aggs" : {
"height_stats" : { "stats" : { "field" : "height" } }
}
}
}
}
It is also possible to order the buckets based on a "deeper" aggregation in the hierarchy. This is supported as long
as the aggregations path are of a single-bucket type, where the last aggregation in the path may either be a single-bucket
one or a metrics one. If it’s a single-bucket type, the order will be defined by the number of docs in the bucket (i.e. doc_count),
in case it’s a metrics one, the same rules as above apply (where the path must indicate the metric name to sort by in case of
a multi-value metrics aggregation, and in case of a single-value metrics aggregation the sort will be applied on that value).
The path must be defined in the following form:
AGG_SEPARATOR := '>' METRIC_SEPARATOR := '.' AGG_NAME := <the name of the aggregation> METRIC := <the name of the metric (in case of multi-value metrics aggregation)> PATH := <AGG_NAME>[<AGG_SEPARATOR><AGG_NAME>]*[<METRIC_SEPARATOR><METRIC>]
{
"aggs" : {
"countries" : {
"terms" : {
"field" : "address.country",
"order" : { "females>height_stats.avg" : "desc" }
},
"aggs" : {
"females" : {
"filter" : { "term" : { "gender" : "female" }},
"aggs" : {
"height_stats" : { "stats" : { "field" : "height" }}
}
}
}
}
}
}
The above will sort the countries buckets based on the average height among the female population.
Multiple criteria can be used to order the buckets by providing an array of order criteria such as the following:
{
"aggs" : {
"countries" : {
"terms" : {
"field" : "address.country",
"order" : [ { "females>height_stats.avg" : "desc" }, { "_count" : "desc" } ]
},
"aggs" : {
"females" : {
"filter" : { "term" : { "gender" : { "female" }}},
"aggs" : {
"height_stats" : { "stats" : { "field" : "height" }}
}
}
}
}
}
}
The above will sort the countries buckets based on the average height among the female population and then by
their doc_count in descending order.
|
|
In the event that two buckets share the same values for all order criteria the bucket’s term value is used as a tie-breaker in ascending alphabetical order to prevent non-deterministic ordering of buckets. |
49.17.7. Minimum document count
It is possible to only return terms that match more than a configured number of hits using the min_doc_count option:
{
"aggs" : {
"tags" : {
"terms" : {
"field" : "tags",
"min_doc_count": 10
}
}
}
}
The above aggregation would only return tags which have been found in 10 hits or more. Default value is 1.
Terms are collected and ordered on a shard level and merged with the terms collected from other shards in a second step. However, the shard does not have the information about the global document count available. The decision if a term is added to a candidate list depends only on the order computed on the shard using local shard frequencies. The min_doc_count criterion is only applied after merging local terms statistics of all shards. In a way the decision to add the term as a candidate is made without being very certain about if the term will actually reach the required min_doc_count. This might cause many (globally) high frequent terms to be missing in the final result if low frequent terms populated the candidate lists. To avoid this, the shard_size parameter can be increased to allow more candidate terms on the shards. However, this increases memory consumption and network traffic.
shard_min_doc_count parameter
The parameter shard_min_doc_count regulates the certainty a shard has if the term should actually be added to the candidate list or not with respect to the min_doc_count. Terms will only be considered if their local shard frequency within the set is higher than the shard_min_doc_count. If your dictionary contains many low frequent terms and you are not interested in those (for example misspellings), then you can set the shard_min_doc_count parameter to filter out candidate terms on a shard level that will with a reasonable certainty not reach the required min_doc_count even after merging the local counts. shard_min_doc_count is set to 0 per default and has no effect unless you explicitly set it.
|
|
Setting min_doc_count=0 will also return buckets for terms that didn’t match any hit. However, some of
the returned terms which have a document count of zero might only belong to deleted documents or documents
from other types, so there is no warranty that a match_all query would find a positive document count for
those terms.
|
|
|
When NOT sorting on doc_count descending, high values of min_doc_count may return a number of buckets
which is less than size because not enough data was gathered from the shards. Missing buckets can be
back by increasing shard_size.
Setting shard_min_doc_count too high will cause terms to be filtered out on a shard level. This value should be set much lower than min_doc_count/#shards.
|
49.17.8. Script
Generating the terms using a script:
{
"aggs" : {
"genders" : {
"terms" : {
"script" : "doc['gender'].value"
}
}
}
}
This will interpret the script parameter as an inline script with the default script language and no script parameters. To use a file script use the following syntax:
{
"aggs" : {
"genders" : {
"terms" : {
"script" : {
"file": "my_script",
"params": {
"field": "gender"
}
}
}
}
}
}
|
|
for indexed scripts replace the file parameter with an id parameter.
|
49.17.9. Value Script
{
"aggs" : {
"genders" : {
"terms" : {
"field" : "gender",
"script" : "'Gender: ' +_value"
}
}
}
}
49.17.10. Filtering Values
It is possible to filter the values for which buckets will be created. This can be done using the include and
exclude parameters which are based on regular expression strings or arrays of exact values.
{
"aggs" : {
"tags" : {
"terms" : {
"field" : "tags",
"include" : ".*sport.*",
"exclude" : "water_.*"
}
}
}
}
In the above example, buckets will be created for all the tags that has the word sport in them, except those starting
with water_ (so the tag water_sports will no be aggregated). The include regular expression will determine what
values are "allowed" to be aggregated, while the exclude determines the values that should not be aggregated. When
both are defined, the exclude has precedence, meaning, the include is evaluated first and only then the exclude.
The syntax is the same as regexp queries.
For matching based on exact values the include and exclude parameters can simply take an array of
strings that represent the terms as they are found in the index:
{
"aggs" : {
"JapaneseCars" : {
"terms" : {
"field" : "make",
"include" : ["mazda", "honda"]
}
},
"ActiveCarManufacturers" : {
"terms" : {
"field" : "make",
"exclude" : ["rover", "jensen"]
}
}
}
}
49.17.11. Multi-field terms aggregation
The terms aggregation does not support collecting terms from multiple fields
in the same document. The reason is that the terms agg doesn’t collect the
string term values themselves, but rather uses
global ordinals
to produce a list of all of the unique values in the field. Global ordinals
results in an important performance boost which would not be possible across
multiple fields.
There are two approaches that you can use to perform a terms agg across
multiple fields:
- Script
-
Use a script to retrieve terms from multiple fields. This disables the global ordinals optimization and will be slower than collecting terms from a single field, but it gives you the flexibility to implement this option at search time.
copy_tofield-
If you know ahead of time that you want to collect the terms from two or more fields, then use
copy_toin your mapping to create a new dedicated field at index time which contains the values from both fields. You can aggregate on this single field, which will benefit from the global ordinals optimization.
49.17.12. Collect mode
Deferring calculation of child aggregations
For fields with many unique terms and a small number of required results it can be more efficient to delay the calculation of child aggregations until the top parent-level aggs have been pruned. Ordinarily, all branches of the aggregation tree are expanded in one depth-first pass and only then any pruning occurs. In some rare scenarios this can be very wasteful and can hit memory constraints. An example problem scenario is querying a movie database for the 10 most popular actors and their 5 most common co-stars:
{
"aggs" : {
"actors" : {
"terms" : {
"field" : "actors",
"size" : 10
},
"aggs" : {
"costars" : {
"terms" : {
"field" : "actors",
"size" : 5
}
}
}
}
}
}
Even though the number of movies may be comparatively small and we want only 50 result buckets there is a combinatorial explosion of buckets
during calculation - a single movie will produce n² buckets where n is the number of actors. The sane option would be to first determine
the 10 most popular actors and only then examine the top co-stars for these 10 actors. This alternative strategy is what we call the breadth_first collection
mode as opposed to the default depth_first mode:
{
"aggs" : {
"actors" : {
"terms" : {
"field" : "actors",
"size" : 10,
"collect_mode" : "breadth_first"
},
"aggs" : {
"costars" : {
"terms" : {
"field" : "actors",
"size" : 5
}
}
}
}
}
}
When using breadth_first mode the set of documents that fall into the uppermost buckets are
cached for subsequent replay so there is a memory overhead in doing this which is linear with the number of matching documents.
In most requests the volume of buckets generated is smaller than the number of documents that fall into them so the default depth_first
collection mode is normally the best bet but occasionally the breadth_first strategy can be significantly more efficient. Currently
elasticsearch will always use the depth_first collect_mode unless explicitly instructed to use breadth_first as in the above example.
Note that the order parameter can still be used to refer to data from a child aggregation when using the breadth_first setting - the parent
aggregation understands that this child aggregation will need to be called first before any of the other child aggregations.
|
|
It is not possible to nest aggregations such as top_hits which require access to match score information under an aggregation that uses
the breadth_first collection mode. This is because this would require a RAM buffer to hold the float score value for every document and
this would typically be too costly in terms of RAM.
|
49.17.13. Execution hint
experimental[The automated execution optimization is experimental, so this parameter is provided temporarily as a way to override the default behaviour]
There are different mechanisms by which terms aggregations can be executed:
-
by using field values directly in order to aggregate data per-bucket (
map) -
by using ordinals of the field and preemptively allocating one bucket per ordinal value (
global_ordinals) -
by using ordinals of the field and dynamically allocating one bucket per ordinal value (
global_ordinals_hash) -
by using per-segment ordinals to compute counts and remap these counts to global counts using global ordinals (
global_ordinals_low_cardinality)
Elasticsearch tries to have sensible defaults so this is something that generally doesn’t need to be configured.
map should only be considered when very few documents match a query. Otherwise the ordinals-based execution modes
are significantly faster. By default, map is only used when running an aggregation on scripts, since they don’t have
ordinals.
global_ordinals_low_cardinality only works for leaf terms aggregations but is usually the fastest execution mode. Memory
usage is linear with the number of unique values in the field, so it is only enabled by default on low-cardinality fields.
global_ordinals is the second fastest option, but the fact that it preemptively allocates buckets can be memory-intensive,
especially if you have one or more sub aggregations. It is used by default on top-level terms aggregations.
global_ordinals_hash on the contrary to global_ordinals and global_ordinals_low_cardinality allocates buckets dynamically
so memory usage is linear to the number of values of the documents that are part of the aggregation scope. It is used by default
in inner aggregations.
{
"aggs" : {
"tags" : {
"terms" : {
"field" : "tags",
"execution_hint": "map"
}
}
}
}
experimental[] the possible values are map, global_ordinals, global_ordinals_hash and global_ordinals_low_cardinality |
Please note that Elasticsearch will ignore this execution hint if it is not applicable and that there is no backward compatibility guarantee on these hints.
49.17.14. Missing value
The missing parameter defines how documents that are missing a value should be treated.
By default they will be ignored but it is also possible to treat them as if they
had a value.
{
"aggs" : {
"tags" : {
"terms" : {
"field" : "tags",
"missing": "N/A"
}
}
}
}
Documents without a value in the tags field will fall into the same bucket as documents that have the value N/A. |
50. Pipeline Aggregations
experimental[]
Pipeline aggregations work on the outputs produced from other aggregations rather than from document sets, adding information to the output tree. There are many different types of pipeline aggregation, each computing different information from other aggregations, but these types can be broken down into two families:
- Parent
-
A family of pipeline aggregations that is provided with the output of its parent aggregation and is able to compute new buckets or new aggregations to add to existing buckets.
- Sibling
-
Pipeline aggregations that are provided with the output of a sibling aggregation and are able to compute a new aggregation which will be at the same level as the sibling aggregation.
Pipeline aggregations can reference the aggregations they need to perform their computation by using the buckets_path
parameter to indicate the paths to the required metrics. The syntax for defining these paths can be found in the
buckets_path Syntax section below.
Pipeline aggregations cannot have sub-aggregations but depending on the type it can reference another pipeline in the buckets_path
allowing pipeline aggregations to be chained. For example, you can chain together two derivatives to calculate the second derivative
(i.e. a derivative of a derivative).
|
|
Because pipeline aggregations only add to the output, when chaining pipeline aggregations the output of each pipeline aggregation will be included in the final output. |
buckets_path Syntax
Most pipeline aggregations require another aggregation as their input. The input aggregation is defined via the buckets_path
parameter, which follows a specific format:
AGG_SEPARATOR := '>' METRIC_SEPARATOR := '.' AGG_NAME := <the name of the aggregation> METRIC := <the name of the metric (in case of multi-value metrics aggregation)> PATH := <AGG_NAME>[<AGG_SEPARATOR><AGG_NAME>]*[<METRIC_SEPARATOR><METRIC>]
For example, the path "my_bucket>my_stats.avg" will path to the avg value in the "my_stats" metric, which is
contained in the "my_bucket" bucket aggregation.
Paths are relative from the position of the pipeline aggregation; they are not absolute paths, and the path cannot go back "up" the
aggregation tree. For example, this moving average is embedded inside a date_histogram and refers to a "sibling"
metric "the_sum":
{
"my_date_histo":{
"date_histogram":{
"field":"timestamp",
"interval":"day"
},
"aggs":{
"the_sum":{
"sum":{ "field": "lemmings" }
},
"the_movavg":{
"moving_avg":{ "buckets_path": "the_sum" }
}
}
}
}
The metric is called "the_sum" |
|
The buckets_path refers to the metric via a relative path "the_sum" |
buckets_path is also used for Sibling pipeline aggregations, where the aggregation is "next" to a series of buckets
instead of embedded "inside" them. For example, the max_bucket aggregation uses the buckets_path to specify
a metric embedded inside a sibling aggregation:
{
"aggs" : {
"sales_per_month" : {
"date_histogram" : {
"field" : "date",
"interval" : "month"
},
"aggs": {
"sales": {
"sum": {
"field": "price"
}
}
}
},
"max_monthly_sales": {
"max_bucket": {
"buckets_path": "sales_per_month>sales"
}
}
}
}
buckets_path instructs this max_bucket aggregation that we want the maximum value of the sales aggregation in the
sales_per_month date histogram. |
Special Paths
Instead of pathing to a metric, buckets_path can use a special "_count" path. This instructs
the pipeline aggregation to use the document count as it’s input. For example, a moving average can be calculated on the document
count of each bucket, instead of a specific metric:
{
"my_date_histo":{
"date_histogram":{
"field":"timestamp",
"interval":"day"
},
"aggs":{
"the_movavg":{
"moving_avg":{ "buckets_path": "_count" }
}
}
}
}
By using _count instead of a metric name, we can calculate the moving average of document counts in the histogram |
Dealing with dots in agg names
An alternate syntax is supported to cope with aggregations or metrics which
have dots in the name, such as the 99.9th
percentile. This metric
may be referred to as:
"buckets_path": "my_percentile[99.9]"
Dealing with gaps in the data
Data in the real world is often noisy and sometimes contains gaps — places where data simply doesn’t exist. This can occur for a variety of reasons, the most common being:
-
Documents falling into a bucket do not contain a required field
-
There are no documents matching the query for one or more buckets
-
The metric being calculated is unable to generate a value, likely because another dependent bucket is missing a value. Some pipeline aggregations have specific requirements that must be met (e.g. a derivative cannot calculate a metric for the first value because there is no previous value, HoltWinters moving average need "warmup" data to begin calculating, etc)
Gap policies are a mechanism to inform the pipeline aggregation about the desired behavior when "gappy" or missing
data is encountered. All pipeline aggregations accept the gap_policy parameter. There are currently two gap policies
to choose from:
- skip
-
This option treats missing data as if the bucket does not exist. It will skip the bucket and continue calculating using the next available value.
- insert_zeros
-
This option will replace missing values with a zero (
0) and pipeline aggregation computation will proceed as normal.
50.1. Avg Bucket Aggregation
experimental[]
A sibling pipeline aggregation which calculates the (mean) average value of a specified metric in a sibling aggregation. The specified metric must be numeric and the sibling aggregation must be a multi-bucket aggregation.
50.1.1. Syntax
An avg_bucket aggregation looks like this in isolation:
{
"avg_bucket": {
"buckets_path": "the_sum"
}
}
Parameter Name |
Description |
Required |
Default Value |
|
The path to the buckets we wish to find the average for (see |
Required |
|
|
The policy to apply when gaps are found in the data (see Dealing with gaps in the data for more details) |
Optional, defaults to |
|
|
format to apply to the output value of this aggregation |
Optional, defaults to |
The following snippet calculates the average of the total monthly sales:
{
"aggs" : {
"sales_per_month" : {
"date_histogram" : {
"field" : "date",
"interval" : "month"
},
"aggs": {
"sales": {
"sum": {
"field": "price"
}
}
}
},
"avg_monthly_sales": {
"avg_bucket": {
"buckets_path": "sales_per_month>sales"
}
}
}
}
buckets_path instructs this avg_bucket aggregation that we want the (mean) average value of the sales aggregation in the
sales_per_month date histogram. |
And the following may be the response:
{
"aggregations": {
"sales_per_month": {
"buckets": [
{
"key_as_string": "2015/01/01 00:00:00",
"key": 1420070400000,
"doc_count": 3,
"sales": {
"value": 550
}
},
{
"key_as_string": "2015/02/01 00:00:00",
"key": 1422748800000,
"doc_count": 2,
"sales": {
"value": 60
}
},
{
"key_as_string": "2015/03/01 00:00:00",
"key": 1425168000000,
"doc_count": 2,
"sales": {
"value": 375
}
}
]
},
"avg_monthly_sales": {
"value": 328.33333333333333
}
}
}
50.2. Derivative Aggregation
experimental[]
A parent pipeline aggregation which calculates the derivative of a specified metric in a parent histogram (or date_histogram)
aggregation. The specified metric must be numeric and the enclosing histogram must have min_doc_count set to 0 (default
for histogram aggregations).
50.2.1. Syntax
A derivative aggregation looks like this in isolation:
{
"derivative": {
"buckets_path": "the_sum"
}
}
Parameter Name |
Description |
Required |
Default Value |
|
The path to the buckets we wish to find the derivative for (see |
Required |
|
|
The policy to apply when gaps are found in the data (see Dealing with gaps in the data for more details) |
Optional, defaults to |
|
|
format to apply to the output value of this aggregation |
Optional, defaults to |
50.2.2. First Order Derivative
The following snippet calculates the derivative of the total monthly sales:
{
"aggs" : {
"sales_per_month" : {
"date_histogram" : {
"field" : "date",
"interval" : "month"
},
"aggs": {
"sales": {
"sum": {
"field": "price"
}
},
"sales_deriv": {
"derivative": {
"buckets_path": "sales"
}
}
}
}
}
}
buckets_path instructs this derivative aggregation to use the output of the sales aggregation for the derivative |
And the following may be the response:
{
"aggregations": {
"sales_per_month": {
"buckets": [
{
"key_as_string": "2015/01/01 00:00:00",
"key": 1420070400000,
"doc_count": 3,
"sales": {
"value": 550
}
},
{
"key_as_string": "2015/02/01 00:00:00",
"key": 1422748800000,
"doc_count": 2,
"sales": {
"value": 60
},
"sales_deriv": {
"value": -490
}
},
{
"key_as_string": "2015/03/01 00:00:00",
"key": 1425168000000,
"doc_count": 2,
"sales": {
"value": 375
},
"sales_deriv": {
"value": 315
}
}
]
}
}
}
| No derivative for the first bucket since we need at least 2 data points to calculate the derivative | |
Derivative value units are implicitly defined by the sales aggregation and the parent histogram so in this case the units
would be $/month assuming the price field has units of $. |
|
The number of documents in the bucket are represented by the doc_count f |
50.2.3. Second Order Derivative
A second order derivative can be calculated by chaining the derivative pipeline aggregation onto the result of another derivative pipeline aggregation as in the following example which will calculate both the first and the second order derivative of the total monthly sales:
{
"aggs" : {
"sales_per_month" : {
"date_histogram" : {
"field" : "date",
"interval" : "month"
},
"aggs": {
"sales": {
"sum": {
"field": "price"
}
},
"sales_deriv": {
"derivative": {
"buckets_path": "sales"
}
},
"sales_2nd_deriv": {
"derivative": {
"buckets_path": "sales_deriv"
}
}
}
}
}
}
buckets_path for the second derivative points to the name of the first derivative |
And the following may be the response:
{
"aggregations": {
"sales_per_month": {
"buckets": [
{
"key_as_string": "2015/01/01 00:00:00",
"key": 1420070400000,
"doc_count": 3,
"sales": {
"value": 550
}
},
{
"key_as_string": "2015/02/01 00:00:00",
"key": 1422748800000,
"doc_count": 2,
"sales": {
"value": 60
},
"sales_deriv": {
"value": -490
}
},
{
"key_as_string": "2015/03/01 00:00:00",
"key": 1425168000000,
"doc_count": 2,
"sales": {
"value": 375
},
"sales_deriv": {
"value": 315
},
"sales_2nd_deriv": {
"value": 805
}
}
]
}
}
}
| No second derivative for the first two buckets since we need at least 2 data points from the first derivative to calculate the second derivative |
50.2.4. Units
The derivative aggregation allows the units of the derivative values to be specified. This returns an extra field in the response
normalized_value which reports the derivative value in the desired x-axis units. In the below example we calculate the derivative
of the total sales per month but ask for the derivative of the sales as in the units of sales per day:
{
"aggs" : {
"sales_per_month" : {
"date_histogram" : {
"field" : "date",
"interval" : "month"
},
"aggs": {
"sales": {
"sum": {
"field": "price"
}
},
"sales_deriv": {
"derivative": {
"buckets_path": "sales",
"unit": "day"
}
}
}
}
}
}
unit specifies what unit to use for the x-axis of the derivative calculation |
And the following may be the response:
{
"aggregations": {
"sales_per_month": {
"buckets": [
{
"key_as_string": "2015/01/01 00:00:00",
"key": 1420070400000,
"doc_count": 3,
"sales": {
"value": 550
}
},
{
"key_as_string": "2015/02/01 00:00:00",
"key": 1422748800000,
"doc_count": 2,
"sales": {
"value": 60
},
"sales_deriv": {
"value": -490,
"normalized_value": -17.5
}
},
{
"key_as_string": "2015/03/01 00:00:00",
"key": 1425168000000,
"doc_count": 2,
"sales": {
"value": 375
},
"sales_deriv": {
"value": 315,
"normalized_value": 10.16129032258065
}
}
]
}
}
}
value is reported in the original units of per month |
|
normalized_value is reported in the desired units of per day
=== Max Bucket Aggregation |
experimental[]
A sibling pipeline aggregation which identifies the bucket(s) with the maximum value of a specified metric in a sibling aggregation and outputs both the value and the key(s) of the bucket(s). The specified metric must be numeric and the sibling aggregation must be a multi-bucket aggregation.
50.2.5. Syntax
A max_bucket aggregation looks like this in isolation:
{
"max_bucket": {
"buckets_path": "the_sum"
}
}
Parameter Name |
Description |
Required |
Default Value |
|
The path to the buckets we wish to find the maximum for (see |
Required |
|
|
The policy to apply when gaps are found in the data (see Dealing with gaps in the data for more details) |
Optional, defaults to |
|
|
format to apply to the output value of this aggregation |
Optional, defaults to |
The following snippet calculates the maximum of the total monthly sales:
{
"aggs" : {
"sales_per_month" : {
"date_histogram" : {
"field" : "date",
"interval" : "month"
},
"aggs": {
"sales": {
"sum": {
"field": "price"
}
}
}
},
"max_monthly_sales": {
"max_bucket": {
"buckets_path": "sales_per_month>sales"
}
}
}
}
buckets_path instructs this max_bucket aggregation that we want the maximum value of the sales aggregation in the
sales_per_month date histogram. |
And the following may be the response:
{
"aggregations": {
"sales_per_month": {
"buckets": [
{
"key_as_string": "2015/01/01 00:00:00",
"key": 1420070400000,
"doc_count": 3,
"sales": {
"value": 550
}
},
{
"key_as_string": "2015/02/01 00:00:00",
"key": 1422748800000,
"doc_count": 2,
"sales": {
"value": 60
}
},
{
"key_as_string": "2015/03/01 00:00:00",
"key": 1425168000000,
"doc_count": 2,
"sales": {
"value": 375
}
}
]
},
"max_monthly_sales": {
"keys": ["2015/01/01 00:00:00"],
"value": 550
}
}
}
keys is an array of strings since the maximum value may be present in multiple buckets |
50.3. Min Bucket Aggregation
experimental[]
A sibling pipeline aggregation which identifies the bucket(s) with the minimum value of a specified metric in a sibling aggregation and outputs both the value and the key(s) of the bucket(s). The specified metric must be numeric and the sibling aggregation must be a multi-bucket aggregation.
50.3.1. Syntax
A max_bucket aggregation looks like this in isolation:
{
"min_bucket": {
"buckets_path": "the_sum"
}
}
Parameter Name |
Description |
Required |
Default Value |
|
The path to the buckets we wish to find the minimum for (see |
Required |
|
|
The policy to apply when gaps are found in the data (see Dealing with gaps in the data for more details) |
Optional, defaults to |
|
|
format to apply to the output value of this aggregation |
Optional, defaults to |
The following snippet calculates the minimum of the total monthly sales:
{
"aggs" : {
"sales_per_month" : {
"date_histogram" : {
"field" : "date",
"interval" : "month"
},
"aggs": {
"sales": {
"sum": {
"field": "price"
}
}
}
},
"min_monthly_sales": {
"min_bucket": {
"buckets_path": "sales_per_month>sales"
}
}
}
}
buckets_path instructs this max_bucket aggregation that we want the minimum value of the sales aggregation in the
sales_per_month date histogram. |
And the following may be the response:
{
"aggregations": {
"sales_per_month": {
"buckets": [
{
"key_as_string": "2015/01/01 00:00:00",
"key": 1420070400000,
"doc_count": 3,
"sales": {
"value": 550
}
},
{
"key_as_string": "2015/02/01 00:00:00",
"key": 1422748800000,
"doc_count": 2,
"sales": {
"value": 60
}
},
{
"key_as_string": "2015/03/01 00:00:00",
"key": 1425168000000,
"doc_count": 2,
"sales": {
"value": 375
}
}
]
},
"min_monthly_sales": {
"keys": ["2015/02/01 00:00:00"],
"value": 60
}
}
}
keys is an array of strings since the minimum value may be present in multiple buckets |
50.4. Sum Bucket Aggregation
experimental[]
A sibling pipeline aggregation which calculates the sum across all bucket of a specified metric in a sibling aggregation. The specified metric must be numeric and the sibling aggregation must be a multi-bucket aggregation.
50.4.1. Syntax
A sum_bucket aggregation looks like this in isolation:
{
"sum_bucket": {
"buckets_path": "the_sum"
}
}
Parameter Name |
Description |
Required |
Default Value |
|
The path to the buckets we wish to find the sum for (see |
Required |
|
|
The policy to apply when gaps are found in the data (see Dealing with gaps in the data for more details) |
Optional, defaults to |
|
|
format to apply to the output value of this aggregation |
Optional, defaults to |
The following snippet calculates the sum of all the total monthly sales buckets:
{
"aggs" : {
"sales_per_month" : {
"date_histogram" : {
"field" : "date",
"interval" : "month"
},
"aggs": {
"sales": {
"sum": {
"field": "price"
}
}
}
},
"sum_monthly_sales": {
"sum_bucket": {
"buckets_path": "sales_per_month>sales"
}
}
}
}
buckets_path instructs this sum_bucket aggregation that we want the sum of the sales aggregation in the
sales_per_month date histogram. |
And the following may be the response:
{
"aggregations": {
"sales_per_month": {
"buckets": [
{
"key_as_string": "2015/01/01 00:00:00",
"key": 1420070400000,
"doc_count": 3,
"sales": {
"value": 550
}
},
{
"key_as_string": "2015/02/01 00:00:00",
"key": 1422748800000,
"doc_count": 2,
"sales": {
"value": 60
}
},
{
"key_as_string": "2015/03/01 00:00:00",
"key": 1425168000000,
"doc_count": 2,
"sales": {
"value": 375
}
}
]
},
"sum_monthly_sales": {
"value": 985
}
}
}
50.5. Stats Bucket Aggregation
experimental[]
A sibling pipeline aggregation which calculates a variety of stats across all bucket of a specified metric in a sibling aggregation. The specified metric must be numeric and the sibling aggregation must be a multi-bucket aggregation.
50.5.1. Syntax
A stats_bucket aggregation looks like this in isolation:
{
"stats_bucket": {
"buckets_path": "the_sum"
}
}
Parameter Name |
Description |
Required |
Default Value |
|
The path to the buckets we wish to calculate stats for (see |
Required |
|
|
The policy to apply when gaps are found in the data (see Dealing with gaps in the data for more details) |
Optional |
|
|
format to apply to the output value of this aggregation |
Optional |
|
The following snippet calculates the sum of all the total monthly sales buckets:
{
"aggs" : {
"sales_per_month" : {
"date_histogram" : {
"field" : "date",
"interval" : "month"
},
"aggs": {
"sales": {
"sum": {
"field": "price"
}
}
}
},
"stats_monthly_sales": {
"stats_bucket": {
"buckets_paths": "sales_per_month>sales"
}
}
}
}
bucket_paths instructs this stats_bucket aggregation that we want the calculate stats for the sales aggregation in the
sales_per_month date histogram. |
And the following may be the response:
{
"aggregations": {
"sales_per_month": {
"buckets": [
{
"key_as_string": "2015/01/01 00:00:00",
"key": 1420070400000,
"doc_count": 3,
"sales": {
"value": 550
}
},
{
"key_as_string": "2015/02/01 00:00:00",
"key": 1422748800000,
"doc_count": 2,
"sales": {
"value": 60
}
},
{
"key_as_string": "2015/03/01 00:00:00",
"key": 1425168000000,
"doc_count": 2,
"sales": {
"value": 375
}
}
]
},
"stats_monthly_sales": {
"count": 3,
"min": 60,
"max": 550,
"avg": 328.333333333,
"sum": 985
}
}
}
50.6. Extended Stats Bucket Aggregation
experimental[]
A sibling pipeline aggregation which calculates a variety of stats across all bucket of a specified metric in a sibling aggregation. The specified metric must be numeric and the sibling aggregation must be a multi-bucket aggregation.
This aggregation provides a few more statistics (sum of squares, standard deviation, etc) compared to the stats_bucket aggregation.
50.6.1. Syntax
A extended_stats_bucket aggregation looks like this in isolation:
{
"extended_stats_bucket": {
"buckets_path": "the_sum"
}
}
Parameter Name |
Description |
Required |
Default Value |
|
The path to the buckets we wish to calculate stats for (see |
Required |
|
|
The policy to apply when gaps are found in the data (see Dealing with gaps in the data for more details) |
Optional |
|
|
format to apply to the output value of this aggregation |
Optional |
|
|
The number of standard deviations above/below the mean to display |
Optional |
2 |
The following snippet calculates the sum of all the total monthly sales buckets:
{
"aggs" : {
"sales_per_month" : {
"date_histogram" : {
"field" : "date",
"interval" : "month"
},
"aggs": {
"sales": {
"sum": {
"field": "price"
}
}
}
},
"stats_monthly_sales": {
"extended_stats_bucket": {
"buckets_paths": "sales_per_month>sales"
}
}
}
}
bucket_paths instructs this extended_stats_bucket aggregation that we want the calculate stats for the sales aggregation in the
sales_per_month date histogram. |
And the following may be the response:
{
"aggregations": {
"sales_per_month": {
"buckets": [
{
"key_as_string": "2015/01/01 00:00:00",
"key": 1420070400000,
"doc_count": 3,
"sales": {
"value": 550
}
},
{
"key_as_string": "2015/02/01 00:00:00",
"key": 1422748800000,
"doc_count": 2,
"sales": {
"value": 60
}
},
{
"key_as_string": "2015/03/01 00:00:00",
"key": 1425168000000,
"doc_count": 2,
"sales": {
"value": 375
}
}
]
},
"stats_monthly_sales": {
"count": 3,
"min": 60,
"max": 550,
"avg": 328.333333333,
"sum": 985,
"sum_of_squares": 446725,
"variance": 41105.5555556,
"std_deviation": 117.054909559,
"std_deviation_bounds": {
"upper": 562.443152451,
"lower": 94.2235142151
}
}
}
}
50.7. Percentiles Bucket Aggregation
experimental[]
A sibling pipeline aggregation which calculates percentiles across all bucket of a specified metric in a sibling aggregation. The specified metric must be numeric and the sibling aggregation must be a multi-bucket aggregation.
50.7.1. Syntax
A percentiles_bucket aggregation looks like this in isolation:
{
"percentiles_bucket": {
"buckets_path": "the_sum"
}
}
Parameter Name |
Description |
Required |
Default Value |
|
The path to the buckets we wish to find the sum for (see |
Required |
|
|
The policy to apply when gaps are found in the data (see Dealing with gaps in the data for more details) |
Optional |
|
|
format to apply to the output value of this aggregation |
Optional |
|
|
The list of percentiles to calculate |
Optional |
|
The following snippet calculates the sum of all the total monthly sales buckets:
{
"aggs" : {
"sales_per_month" : {
"date_histogram" : {
"field" : "date",
"interval" : "month"
},
"aggs": {
"sales": {
"sum": {
"field": "price"
}
}
}
},
"sum_monthly_sales": {
"percentiles_bucket": {
"buckets_paths": "sales_per_month>sales",
"percents": [ 25.0, 50.0, 75.0 ]
}
}
}
}
bucket_paths instructs this percentiles_bucket aggregation that we want to calculate percentiles for
the sales aggregation in the sales_per_month date histogram. |
|
percents specifies which percentiles we wish to calculate, in this case, the 25th, 50th and 75th percentil |
And the following may be the response:
{
"aggregations": {
"sales_per_month": {
"buckets": [
{
"key_as_string": "2015/01/01 00:00:00",
"key": 1420070400000,
"doc_count": 3,
"sales": {
"value": 550
}
},
{
"key_as_string": "2015/02/01 00:00:00",
"key": 1422748800000,
"doc_count": 2,
"sales": {
"value": 60
}
},
{
"key_as_string": "2015/03/01 00:00:00",
"key": 1425168000000,
"doc_count": 2,
"sales": {
"value": 375
}
}
]
},
"percentiles_monthly_sales": {
"values" : {
"25.0": 60,
"50.0": 375",
"75.0": 550
}
}
}
}
50.7.2. Percentiles_bucket implementation
The Percentile Bucket returns the nearest input data point that is not greater than the requested percentile; it does not interpolate between data points.
The percentiles are calculated exactly and is not an approximation (unlike the Percentiles Metric). This means
the implementation maintains an in-memory, sorted list of your data to compute the percentiles, before discarding the
data. You may run into memory pressure issues if you attempt to calculate percentiles over many millions of
data-points in a single percentiles_bucket.
50.8. Moving Average Aggregation
experimental[]
Given an ordered series of data, the Moving Average aggregation will slide a window across the data and emit the average
value of that window. For example, given the data [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], we can calculate a simple moving
average with windows size of 5 as follows:
-
(1 + 2 + 3 + 4 + 5) / 5 = 3
-
(2 + 3 + 4 + 5 + 6) / 5 = 4
-
(3 + 4 + 5 + 6 + 7) / 5 = 5
-
etc
Moving averages are a simple method to smooth sequential data. Moving averages are typically applied to time-based data, such as stock prices or server metrics. The smoothing can be used to eliminate high frequency fluctuations or random noise, which allows the lower frequency trends to be more easily visualized, such as seasonality.
50.8.1. Syntax
A moving_avg aggregation looks like this in isolation:
{
"moving_avg": {
"buckets_path": "the_sum",
"model": "holt",
"window": 5,
"gap_policy": "insert_zero",
"settings": {
"alpha": 0.8
}
}
}
Parameter Name |
Description |
Required |
Default Value |
|
Path to the metric of interest (see |
Required |
|
|
The moving average weighting model that we wish to use |
Optional |
|
|
Determines what should happen when a gap in the data is encountered. |
Optional |
|
|
The size of window to "slide" across the histogram. |
Optional |
|
|
If the model should be algorithmically minimized. See Minimization for more details |
Optional |
|
|
Model-specific settings, contents which differ depending on the model specified. |
Optional |
moving_avg aggregations must be embedded inside of a histogram or date_histogram aggregation. They can be
embedded like any other metric aggregation:
{
"my_date_histo":{
"date_histogram":{
"field":"timestamp",
"interval":"day"
},
"aggs":{
"the_sum":{
"sum":{ "field": "lemmings" }
},
"the_movavg":{
"moving_avg":{ "buckets_path": "the_sum" }
}
}
}
}
A date_histogram named "my_date_histo" is constructed on the "timestamp" field, with one-day intervals |
|
A sum metric is used to calculate the sum of a field. This could be any metric (sum, min, max, etc) |
|
Finally, we specify a moving_avg aggregation which uses "the_sum" metric as its input. |
Moving averages are built by first specifying a histogram or date_histogram over a field. You can then optionally
add normal metrics, such as a sum, inside of that histogram. Finally, the moving_avg is embedded inside the histogram.
The buckets_path parameter is then used to "point" at one of the sibling metrics inside of the histogram (see
buckets_path Syntax for a description of the syntax for buckets_path.
50.8.2. Models
The moving_avg aggregation includes four different moving average "models". The main difference is how the values in the
window are weighted. As data-points become "older" in the window, they may be weighted differently. This will
affect the final average for that window.
Models are specified using the model parameter. Some models may have optional configurations which are specified inside
the settings parameter.
Simple
The simple model calculates the sum of all values in the window, then divides by the size of the window. It is effectively
a simple arithmetic mean of the window. The simple model does not perform any time-dependent weighting, which means
the values from a simple moving average tend to "lag" behind the real data.
{
"the_movavg":{
"moving_avg":{
"buckets_path": "the_sum",
"window" : 30,
"model" : "simple"
}
}
}
A simple model has no special settings to configure
The window size can change the behavior of the moving average. For example, a small window ("window": 10) will closely
track the data and only smooth out small scale fluctuations:
In contrast, a simple moving average with larger window ("window": 100) will smooth out all higher-frequency fluctuations,
leaving only low-frequency, long term trends. It also tends to "lag" behind the actual data by a substantial amount:
50.8.3. Linear
The linear model assigns a linear weighting to points in the series, such that "older" datapoints (e.g. those at
the beginning of the window) contribute a linearly less amount to the total average. The linear weighting helps reduce
the "lag" behind the data’s mean, since older points have less influence.
{
"the_movavg":{
"moving_avg":{
"buckets_path": "the_sum",
"window" : 30,
"model" : "linear"
}
}
A linear model has no special settings to configure
Like the simple model, window size can change the behavior of the moving average. For example, a small window ("window": 10)
will closely track the data and only smooth out small scale fluctuations:
In contrast, a linear moving average with larger window ("window": 100) will smooth out all higher-frequency fluctuations,
leaving only low-frequency, long term trends. It also tends to "lag" behind the actual data by a substantial amount,
although typically less than the simple model:
50.8.4. EWMA (Exponentially Weighted)
The ewma model (aka "single-exponential") is similar to the linear model, except older data-points become exponentially less important,
rather than linearly less important. The speed at which the importance decays can be controlled with an alpha
setting. Small values make the weight decay slowly, which provides greater smoothing and takes into account a larger
portion of the window. Larger valuers make the weight decay quickly, which reduces the impact of older values on the
moving average. This tends to make the moving average track the data more closely but with less smoothing.
The default value of alpha is 0.3, and the setting accepts any float from 0-1 inclusive.
The EWMA model can be Minimized
{
"the_movavg":{
"moving_avg":{
"buckets_path": "the_sum",
"window" : 30,
"model" : "ewma",
"settings" : {
"alpha" : 0.5
}
}
}
50.8.5. Holt-Linear
The holt model (aka "double exponential") incorporates a second exponential term which
tracks the data’s trend. Single exponential does not perform well when the data has an underlying linear trend. The
double exponential model calculates two values internally: a "level" and a "trend".
The level calculation is similar to ewma, and is an exponentially weighted view of the data. The difference is
that the previously smoothed value is used instead of the raw value, which allows it to stay close to the original series.
The trend calculation looks at the difference between the current and last value (e.g. the slope, or trend, of the
smoothed data). The trend value is also exponentially weighted.
Values are produced by multiplying the level and trend components.
The default value of alpha is 0.3 and beta is 0.1. The settings accept any float from 0-1 inclusive.
The Holt-Linear model can be Minimized
{
"the_movavg":{
"moving_avg":{
"buckets_path": "the_sum",
"window" : 30,
"model" : "holt",
"settings" : {
"alpha" : 0.5,
"beta" : 0.5
}
}
}
In practice, the alpha value behaves very similarly in holt as ewma: small values produce more smoothing
and more lag, while larger values produce closer tracking and less lag. The value of beta is often difficult
to see. Small values emphasize long-term trends (such as a constant linear trend in the whole series), while larger
values emphasize short-term trends. This will become more apparently when you are predicting values.
50.8.6. Holt-Winters
The holt_winters model (aka "triple exponential") incorporates a third exponential term which
tracks the seasonal aspect of your data. This aggregation therefore smooths based on three components: "level", "trend"
and "seasonality".
The level and trend calculation is identical to holt The seasonal calculation looks at the difference between
the current point, and the point one period earlier.
Holt-Winters requires a little more handholding than the other moving averages. You need to specify the "periodicity"
of your data: e.g. if your data has cyclic trends every 7 days, you would set period: 7. Similarly if there was
a monthly trend, you would set it to 30. There is currently no periodicity detection, although that is planned
for future enhancements.
There are two varieties of Holt-Winters: additive and multiplicative.
"Cold Start"
Unfortunately, due to the nature of Holt-Winters, it requires two periods of data to "bootstrap" the algorithm. This
means that your window must always be at least twice the size of your period. An exception will be thrown if it
isn’t. It also means that Holt-Winters will not emit a value for the first 2 * period buckets; the current algorithm
does not backcast.
Because the "cold start" obscures what the moving average looks like, the rest of the Holt-Winters images are truncated to not show the "cold start". Just be aware this will always be present at the beginning of your moving averages!
Additive Holt-Winters
Additive seasonality is the default; it can also be specified by setting "type": "add". This variety is preferred
when the seasonal affect is additive to your data. E.g. you could simply subtract the seasonal effect to "de-seasonalize"
your data into a flat trend.
The default values of alpha and gamma are 0.3 while beta is 0.1. The settings accept any float from 0-1 inclusive.
The default value of period is 1.
The additive Holt-Winters model can be Minimized
{
"the_movavg":{
"moving_avg":{
"buckets_path": "the_sum",
"window" : 30,
"model" : "holt_winters",
"settings" : {
"type" : "add",
"alpha" : 0.5,
"beta" : 0.5,
"gamma" : 0.5,
"period" : 7
}
}
}
Multiplicative Holt-Winters
Multiplicative is specified by setting "type": "mult". This variety is preferred when the seasonal affect is
multiplied against your data. E.g. if the seasonal affect is x5 the data, rather than simply adding to it.
The default values of alpha and gamma are 0.3 while beta is 0.1. The settings accept any float from 0-1 inclusive.
The default value of period is 1.
The multiplicative Holt-Winters model can be Minimized
|
|
Multiplicative Holt-Winters works by dividing each data point by the seasonal value. This is problematic if any of
your data is zero, or if there are gaps in the data (since this results in a divid-by-zero). To combat this, the
|
{
"the_movavg":{
"moving_avg":{
"buckets_path": "the_sum",
"window" : 30,
"model" : "holt_winters",
"settings" : {
"type" : "mult",
"alpha" : 0.5,
"beta" : 0.5,
"gamma" : 0.5,
"period" : 7,
"pad" : true
}
}
}
50.8.7. Prediction
All the moving average model support a "prediction" mode, which will attempt to extrapolate into the future given the current smoothed, moving average. Depending on the model and parameter, these predictions may or may not be accurate.
Predictions are enabled by adding a predict parameter to any moving average aggregation, specifying the number of
predictions you would like appended to the end of the series. These predictions will be spaced out at the same interval
as your buckets:
{
"the_movavg":{
"moving_avg":{
"buckets_path": "the_sum",
"window" : 30,
"model" : "simple",
"predict" : 10
}
}
The simple, linear and ewma models all produce "flat" predictions: they essentially converge on the mean
of the last value in the series, producing a flat:
In contrast, the holt model can extrapolate based on local or global constant trends. If we set a high beta
value, we can extrapolate based on local constant trends (in this case the predictions head down, because the data at the end
of the series was heading in a downward direction):
In contrast, if we choose a small beta, the predictions are based on the global constant trend. In this series, the
global trend is slightly positive, so the prediction makes a sharp u-turn and begins a positive slope:
The holt_winters model has the potential to deliver the best predictions, since it also incorporates seasonal
fluctuations into the model:
50.8.8. Minimization
Some of the models (EWMA, Holt-Linear, Holt-Winters) require one or more parameters to be configured. Parameter choice can be tricky and sometimes non-intuitive. Furthermore, small deviations in these parameters can sometimes have a drastic effect on the output moving average.
For that reason, the three "tunable" models can be algorithmically minimized. Minimization is a process where parameters are tweaked until the predictions generated by the model closely match the output data. Minimization is not fullproof and can be susceptible to overfitting, but it often gives better results than hand-tuning.
Minimization is disabled by default for ewma and holt_linear, while it is enabled by default for holt_winters.
Minimization is most useful with Holt-Winters, since it helps improve the accuracy of the predictions. EWMA and
Holt-Linear are not great predictors, and mostly used for smoothing data, so minimization is less useful on those
models.
Minimization is enabled/disabled via the minimize parameter:
{
"the_movavg":{
"moving_avg":{
"buckets_path": "the_sum",
"model" : "holt_winters",
"window" : 30,
"minimize" : true,
"settings" : {
"period" : 7
}
}
}
Minimization is enabled with the minimize parameter |
When enabled, minimization will find the optimal values for alpha, beta and gamma. The user should still provide
appropriate values for window, period and type.
|
|
Minimization works by running a stochastic process called simulated annealing. This process will usually generate a good solution, but is not guaranteed to find the global optimum. It also requires some amount of additional computational power, since the model needs to be re-run multiple times as the values are tweaked. The run-time of minimization is linear to the size of the window being processed: excessively large windows may cause latency. Finally, minimization fits the model to the last |
50.9. Cumulative Sum Aggregation
experimental[]
A parent pipeline aggregation which calculates the cumulative sum of a specified metric in a parent histogram (or date_histogram)
aggregation. The specified metric must be numeric and the enclosing histogram must have min_doc_count set to 0 (default
for histogram aggregations).
50.9.1. Syntax
A cumulative_sum aggregation looks like this in isolation:
{
"cumulative_sum": {
"buckets_path": "the_sum"
}
}
Parameter Name |
Description |
Required |
Default Value |
|
The path to the buckets we wish to find the cumulative sum for (see |
Required |
|
|
format to apply to the output value of this aggregation |
Optional, defaults to |
The following snippet calculates the cumulative sum of the total monthly sales:
{
"aggs" : {
"sales_per_month" : {
"date_histogram" : {
"field" : "date",
"interval" : "month"
},
"aggs": {
"sales": {
"sum": {
"field": "price"
}
},
"cumulative_sales": {
"cumulative_sum": {
"buckets_path": "sales"
}
}
}
}
}
}
buckets_path instructs this cumulative sum aggregation to use the output of the sales aggregation for the cumulative sum |
And the following may be the response:
{
"aggregations": {
"sales_per_month": {
"buckets": [
{
"key_as_string": "2015/01/01 00:00:00",
"key": 1420070400000,
"doc_count": 3,
"sales": {
"value": 550
},
"cumulative_sales": {
"value": 550
}
},
{
"key_as_string": "2015/02/01 00:00:00",
"key": 1422748800000,
"doc_count": 2,
"sales": {
"value": 60
},
"cumulative_sales": {
"value": 610
}
},
{
"key_as_string": "2015/03/01 00:00:00",
"key": 1425168000000,
"doc_count": 2,
"sales": {
"value": 375
},
"cumulative_sales": {
"value": 985
}
}
]
}
}
}
50.10. Bucket Script Aggregation
experimental[]
A parent pipeline aggregation which executes a script which can perform per bucket computations on specified metrics in the parent multi-bucket aggregation. The specified metric must be numeric and the script must return a numeric value.
50.10.1. Syntax
A bucket_script aggregation looks like this in isolation:
{
"bucket_script": {
"buckets_path": {
"my_var1": "the_sum",
"my_var2": "the_value_count"
},
"script": "my_var1 / my_var2"
}
}
Here, my_var1 is the name of the variable for this buckets path to use in the script, the_sum is the path to
the metrics to use for that variable. |
Parameter Name |
Description |
Required |
Default Value |
|
The script to run for this aggregation. The script can be inline, file or indexed. (see Scripting for more details) |
Required |
|
|
A map of script variables and their associated path to the buckets we wish to use for the variable
(see |
Required |
|
|
The policy to apply when gaps are found in the data (see Dealing with gaps in the data for more details) |
Optional, defaults to |
|
|
format to apply to the output value of this aggregation |
Optional, defaults to |
The following snippet calculates the ratio percentage of t-shirt sales compared to total sales each month:
{
"aggs" : {
"sales_per_month" : {
"date_histogram" : {
"field" : "date",
"interval" : "month"
},
"aggs": {
"total_sales": {
"sum": {
"field": "price"
}
},
"t-shirts": {
"filter": {
"term": {
"type": "t-shirt"
}
},
"aggs": {
"sales": {
"sum": {
"field": "price"
}
}
}
},
"t-shirt-percentage": {
"bucket_script": {
"buckets_path": {
"tShirtSales": "t-shirts>sales",
"totalSales": "total_sales"
},
"script": "tShirtSales / totalSales * 100"
}
}
}
}
}
}
And the following may be the response:
{
"aggregations": {
"sales_per_month": {
"buckets": [
{
"key_as_string": "2015/01/01 00:00:00",
"key": 1420070400000,
"doc_count": 3,
"total_sales": {
"value": 50
},
"t-shirts": {
"doc_count": 2,
"sales": {
"value": 10
}
},
"t-shirt-percentage": {
"value": 20
}
},
{
"key_as_string": "2015/02/01 00:00:00",
"key": 1422748800000,
"doc_count": 2
"total_sales": {
"value": 60
},
"t-shirts": {
"doc_count": 1,
"sales": {
"value": 15
}
},
"t-shirt-percentage": {
"value": 25
}
},
{
"key_as_string": "2015/03/01 00:00:00",
"key": 1425168000000,
"doc_count": 2,
"total_sales": {
"value": 40
},
"t-shirts": {
"doc_count": 1,
"sales": {
"value": 20
}
},
"t-shirt-percentage": {
"value": 50
}
}
]
}
}
}
50.11. Bucket Selector Aggregation
experimental[]
A parent pipeline aggregation which executes a script which determines whether the current bucket will be retained
in the parent multi-bucket aggregation. The specified metric must be numeric and the script must return a boolean value.
If the script language is expression then a numeric return value is permitted. In this case 0.0 will be evaluated as false
and all other values will evaluate to true.
Note: The bucket_selector aggregation, like all pipeline aggregations, executions after all other sibling aggregations. This means that using the bucket_selector aggregation to filter the returned buckets in the response does not save on execution time running the aggregations.
50.11.1. Syntax
A bucket_selector aggregation looks like this in isolation:
{
"bucket_selector": {
"buckets_path": {
"my_var1": "the_sum",
"my_var2": "the_value_count"
},
"script": "my_var1 > my_var2"
}
}
Here, my_var1 is the name of the variable for this buckets path to use in the script, the_sum is the path to
the metrics to use for that variable. |
Parameter Name |
Description |
Required |
Default Value |
|
The script to run for this aggregation. The script can be inline, file or indexed. (see Scripting for more details) |
Required |
|
|
A map of script variables and their associated path to the buckets we wish to use for the variable
(see |
Required |
|
|
The policy to apply when gaps are found in the data (see Dealing with gaps in the data for more details) |
Optional, defaults to |
The following snippet only retains buckets where the total sales for the month is less than or equal to 50:
{
"aggs" : {
"sales_per_month" : {
"date_histogram" : {
"field" : "date",
"interval" : "month"
},
"aggs": {
"total_sales": {
"sum": {
"field": "price"
}
}
"sales_bucket_filter": {
"bucket_selector": {
"buckets_path": {
"totalSales": "total_sales"
},
"script": "totalSales <= 50"
}
}
}
}
}
}
And the following may be the response:
{
"aggregations": {
"sales_per_month": {
"buckets": [
{
"key_as_string": "2015/01/01 00:00:00",
"key": 1420070400000,
"doc_count": 3,
"total_sales": {
"value": 50
}
},
{
"key_as_string": "2015/03/01 00:00:00",
"key": 1425168000000,
"doc_count": 2,
"total_sales": {
"value": 40
},
}
]
}
}
}
Bucket for 2015/02/01 00:00:00 has been removed as its total sales exceeded 50
=== Serial Differencing Aggregation |
experimental[]
Serial differencing is a technique where values in a time series are subtracted from itself at different time lags or periods. For example, the datapoint f(x) = f(xt) - f(xt-n), where n is the period being used.
A period of 1 is equivalent to a derivative with no time normalization: it is simply the change from one point to the next. Single periods are useful for removing constant, linear trends.
Single periods are also useful for transforming data into a stationary series. In this example, the Dow Jones is plotted over ~250 days. The raw data is not stationary, which would make it difficult to use with some techniques.
By calculating the first-difference, we de-trend the data (e.g. remove a constant, linear trend). We can see that the data becomes a stationary series (e.g. the first difference is randomly distributed around zero, and doesn’t seem to exhibit any pattern/behavior). The transformation reveals that the dataset is following a random-walk; the value is the previous value +/- a random amount. This insight allows selection of further tools for analysis.
Larger periods can be used to remove seasonal / cyclic behavior. In this example, a population of lemmings was synthetically generated with a sine wave + constant linear trend + random noise. The sine wave has a period of 30 days.
The first-difference removes the constant trend, leaving just a sine wave. The 30th-difference is then applied to the first-difference to remove the cyclic behavior, leaving a stationary series which is amenable to other analysis.
50.11.2. Syntax
A serial_diff aggregation looks like this in isolation:
{
"serial_diff": {
"buckets_path": "the_sum",
"lag": "7"
}
}
Parameter Name |
Description |
Required |
Default Value |
|
Path to the metric of interest (see |
Required |
|
|
The historical bucket to subtract from the current value. E.g. a lag of 7 will subtract the current value from the value 7 buckets ago. Must be a positive, non-zero integer |
Optional |
|
|
Determines what should happen when a gap in the data is encountered. |
Optional |
|
|
Format to apply to the output value of this aggregation |
Optional |
|
serial_diff aggregations must be embedded inside of a histogram or date_histogram aggregation:
{
"aggs": {
"my_date_histo": {
"date_histogram": {
"field": "timestamp",
"interval": "day"
},
"aggs": {
"the_sum": {
"sum": {
"field": "lemmings"
}
},
"thirtieth_difference": {
"serial_diff": {
"buckets_path": "the_sum",
"lag" : 30
}
}
}
}
}
}
A date_histogram named "my_date_histo" is constructed on the "timestamp" field, with one-day intervals |
|
A sum metric is used to calculate the sum of a field. This could be any metric (sum, min, max, etc) |
|
Finally, we specify a serial_diff aggregation which uses "the_sum" metric as its input. |
Serial differences are built by first specifying a histogram or date_histogram over a field. You can then optionally
add normal metrics, such as a sum, inside of that histogram. Finally, the serial_diff is embedded inside the histogram.
The buckets_path parameter is then used to "point" at one of the sibling metrics inside of the histogram (see
buckets_path Syntax for a description of the syntax for buckets_path.
51. Caching heavy aggregations
Frequently used aggregations (e.g. for display on the home page of a website) can be cached for faster responses. These cached results are the same results that would be returned by an uncached aggregation — you will never get stale results.
See Shard request cache for more details.
52. Returning only aggregation results
There are many occasions when aggregations are required but search hits are not. For these cases the hits can be ignored by
setting size=0. For example:
$ curl -XGET 'http://localhost:9200/twitter/tweet/_search' -d '{
"size": 0,
"aggregations": {
"my_agg": {
"terms": {
"field": "text"
}
}
}
}
'
Setting size to 0 avoids executing the fetch phase of the search making the request more efficient.
53. Aggregation Metadata
You can associate a piece of metadata with individual aggregations at request time that will be returned in place at response time.
Consider this example where we want to associate the color blue with our terms aggregation.
{
...
"aggs": {
"titles": {
"terms": {
"field": "title"
},
"meta": {
"color": "blue"
},
}
}
}
Then that piece of metadata will be returned in place for our titles terms aggregation
{
...
"aggregations": {
"titles": {
"meta": {
"color" : "blue"
},
"buckets": [
]
}
}
}
Indices APIs
The indices APIs are used to manage individual indices, index settings, aliases, mappings, index templates and warmers.
Index management:
Mapping management:
Alias management:
Index settings:
Replica configurations
Monitoring:
Status management:
54. Create Index
The create index API allows to instantiate an index. Elasticsearch provides support for multiple indices, including executing operations across several indices.
Index Settings
Each index created can have specific settings associated with it.
$ curl -XPUT 'http://localhost:9200/twitter/'
$ curl -XPUT 'http://localhost:9200/twitter/' -d '
index :
number_of_shards : 3
number_of_replicas : 2
'
Default for number_of_shards is 5 |
|
Default for number_of_replicas is 1 (ie one replica for each primary shard) |
The above second curl example shows how an index called twitter can be
created with specific settings for it using YAML.
In this case, creating an index with 3 shards, each with 2 replicas. The
index settings can also be defined with JSON:
$ curl -XPUT 'http://localhost:9200/twitter/' -d '{
"settings" : {
"index" : {
"number_of_shards" : 3,
"number_of_replicas" : 2
}
}
}'
or more simplified
$ curl -XPUT 'http://localhost:9200/twitter/' -d '{
"settings" : {
"number_of_shards" : 3,
"number_of_replicas" : 2
}
}'
|
|
You do not have to explicitly specify index section inside the
settings section.
|
For more information regarding all the different index level settings that can be set when creating an index, please check the index modules section.
Mappings
The create index API allows to provide a set of one or more mappings:
curl -XPOST localhost:9200/test -d '{
"settings" : {
"number_of_shards" : 1
},
"mappings" : {
"type1" : {
"properties" : {
"field1" : { "type" : "string", "index" : "not_analyzed" }
}
}
}
}'
Warmers
The create index API allows also to provide a set of warmers:
curl -XPUT localhost:9200/test -d '{
"warmers" : {
"warmer_1" : {
"source" : {
"query" : {
...
}
}
}
}
}'
Aliases
The create index API allows also to provide a set of aliases:
curl -XPUT localhost:9200/test -d '{
"aliases" : {
"alias_1" : {},
"alias_2" : {
"filter" : {
"term" : {"user" : "kimchy" }
},
"routing" : "kimchy"
}
}
}'
Creation Date
When an index is created, a timestamp is stored in the index metadata for the creation date. By
default this is automatically generated but it can also be specified using the
creation_date parameter on the create index API:
curl -XPUT localhost:9200/test -d '{
"creation_date" : 1407751337000
}'
creation_date is set using epoch time in milliseconds. |
55. Delete Index
The delete index API allows to delete an existing index.
$ curl -XDELETE 'http://localhost:9200/twitter/'
The above example deletes an index called twitter. Specifying an index,
alias or wildcard expression is required.
The delete index API can also be applied to more than one index, by either using a comma separated list, or on all indices (be careful!) by using _all or * as index.
In order to disable allowing to delete indices via wildcards or _all,
set action.destructive_requires_name setting in the config to true.
This setting can also be changed via the cluster update settings api.
56. Get Index
The get index API allows to retrieve information about one or more indexes.
$ curl -XGET 'http://localhost:9200/twitter/'
The above example gets the information for an index called twitter. Specifying an index,
alias or wildcard expression is required.
The get index API can also be applied to more than one index, or on
all indices by using _all or * as index.
Filtering index information
The information returned by the get API can be filtered to include only specific features by specifying a comma delimited list of features in the URL:
$ curl -XGET 'http://localhost:9200/twitter/_settings,_mappings'
The above command will only return the settings and mappings for the index called twitter.
The available features are _settings, _mappings, _warmers and _aliases.
57. Indices Exists
Used to check if the index (indices) exists or not. For example:
curl -XHEAD -i 'http://localhost:9200/twitter'
The HTTP status code indicates if the index exists or not. A 404 means
it does not exist, and 200 means it does.
58. Open / Close Index API
The open and close index APIs allow to close an index, and later on opening it. A closed index has almost no overhead on the cluster (except for maintaining its metadata), and is blocked for read/write operations. A closed index can be opened which will then go through the normal recovery process.
The REST endpoint is /{index}/_close and /{index}/_open. For
example:
curl -XPOST 'localhost:9200/my_index/_close'
curl -XPOST 'localhost:9200/my_index/_open'
It is possible to open and close multiple indices. An error will be thrown
if the request explicitly refers to a missing index. This behaviour can be
disabled using the ignore_unavailable=true parameter.
All indices can be opened or closed at once using _all as the index name
or specifying patterns that identify them all (e.g. *).
Identifying indices via wildcards or _all can be disabled by setting the
action.destructive_requires_name flag in the config file to true.
This setting can also be changed via the cluster update settings api.
Closed indices consume a significant amount of disk-space which can cause problems
issues in managed environments. Closing indices can be disabled via the cluster settings
API by setting cluster.indices.close.enable to false. The default is true.
59. Put Mapping
The PUT mapping API allows you to add a new type to an existing index, or new fields to an existing type:
PUT twitter
{
"mappings": {
"tweet": {
"properties": {
"message": {
"type": "string"
}
}
}
}
}
PUT twitter/_mapping/user
{
"properties": {
"name": {
"type": "string"
}
}
}
PUT twitter/_mapping/tweet
{
"properties": {
"user_name": {
"type": "string"
}
}
}
Creates an index called twitter with the message field in the tweet mapping type. |
|
Uses the PUT mapping API to add a new mapping type called user. |
|
Uses the PUT mapping API to add a new field called user_name to the tweet mapping type. |
More information on how to define type mappings can be found in the mapping section.
Multi-index
The PUT mapping API can be applied to multiple indices with a single request. It has the following format:
PUT /{index}/_mapping/{type}
{ body }
-
{index}accepts multiple index names and wildcards. -
{type}is the name of the type to update. -
{body}contains the mapping changes that should be applied.
Updating field mappings
In general, the mapping for existing fields cannot be updated. There are some exceptions to this rule. For instance:
-
new
propertiescan be added to Object datatype fields. -
new multi-fields can be added to existing fields.
-
doc_valuescan be disabled, but not enabled. -
the
ignore_aboveparameter can be updated.
For example:
PUT my_index
{
"mappings": {
"user": {
"properties": {
"name": {
"properties": {
"first": {
"type": "string"
}
}
},
"user_id": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
PUT my_index/_mapping/user
{
"properties": {
"name": {
"properties": {
"last": {
"type": "string"
}
}
},
"user_id": {
"type": "string",
"index": "not_analyzed",
"ignore_above": 100
}
}
}
Create an index with a first field under the name Object datatype field, and a user_id field. |
|
Add a last field under the name object field. |
|
Update the ignore_above setting from its default of 0. |
Each mapping parameter specifies whether or not its setting can be updated on an existing field.
Conflicts between fields in different types
Fields in the same index with the same name in two different types must have
the same mapping, as they are backed by the same field internally. Trying to
update a mapping parameter for a field which
exists in more than one type will throw an exception, unless you specify the
update_all_types parameter, in which case it will update that parameter
across all fields with the same name in the same index.
|
|
The only parameters which are exempt from this rule — they can be set to different values on each field — can be found in Fields are shared across mapping types. |
For example:
PUT my_index
{
"mappings": {
"type_one": {
"properties": {
"text": {
"type": "string",
"analyzer": "standard"
}
}
},
"type_two": {
"properties": {
"text": {
"type": "string",
"analyzer": "standard"
}
}
}
}
}
PUT my_index/_mapping/type_one
{
"properties": {
"text": {
"type": "string",
"analyzer": "standard",
"search_analyzer": "whitespace"
}
}
}
PUT my_index/_mapping/type_one?update_all_types
{
"properties": {
"text": {
"type": "string",
"analyzer": "standard",
"search_analyzer": "whitespace"
}
}
}
Create an index with two types, both of which contain a text field which have the same mapping. |
|
Trying to update the search_analyzer just for type_one throws an exception like "Merge failed with failures...". |
|
Adding the update_all_types parameter updates the text field in type_one and type_two. |
60. Get Mapping
The get mapping API allows to retrieve mapping definitions for an index or index/type.
curl -XGET 'http://localhost:9200/twitter/_mapping/tweet'
Multiple Indices and Types
The get mapping API can be used to get more than one index or type
mapping with a single call. General usage of the API follows the
following syntax: host:port/{index}/_mapping/{type} where both
{index} and {type} can accept a comma-separated list of names. To
get mappings for all indices you can use _all for {index}. The
following are some examples:
curl -XGET 'http://localhost:9200/_mapping/twitter,kimchy'
curl -XGET 'http://localhost:9200/_all/_mapping/tweet,book'
If you want to get mappings of all indices and types then the following two examples are equivalent:
curl -XGET 'http://localhost:9200/_all/_mapping'
curl -XGET 'http://localhost:9200/_mapping'
61. Get Field Mapping
The get field mapping API allows you to retrieve mapping definitions for one or more fields. This is useful when you do not need the complete type mapping returned by the Get Mapping API.
The following returns the mapping of the field text only:
curl -XGET 'http://localhost:9200/twitter/_mapping/tweet/field/text'
For which the response is (assuming text is a default string field):
{
"twitter": {
"tweet": {
"text": {
"full_name": "text",
"mapping": {
"text": { "type": "string" }
}
}
}
}
}
Multiple Indices, Types and Fields
The get field mapping API can be used to get the mapping of multiple fields from more than one index or type
with a single call. General usage of the API follows the
following syntax: host:port/{index}/{type}/_mapping/field/{field} where
{index}, {type} and {field} can stand for comma-separated list of names or wild cards. To
get mappings for all indices you can use _all for {index}. The
following are some examples:
curl -XGET 'http://localhost:9200/twitter,kimchy/_mapping/field/message'
curl -XGET 'http://localhost:9200/_all/_mapping/tweet,book/field/message,user.id'
curl -XGET 'http://localhost:9200/_all/_mapping/tw*/field/*.id'
Specifying fields
The get mapping api allows you to specify one or more fields separated with by a comma. You can also use wildcards. The field names can be any of the following:
| Full names |
the full path, including any parent object name the field is
part of (ex. |
| Field names |
the name of the field without the path to it (ex. |
The above options are specified in the order the field parameter is resolved.
The first field found which matches is returned. This is especially important
if index names or field names are used as those can be ambiguous.
For example, consider the following mapping:
{
"article": {
"properties": {
"id": { "type": "string" },
"title": { "type": "string"},
"abstract": { "type": "string"},
"author": {
"properties": {
"id": { "type": "string" },
"name": { "type": "string" }
}
}
}
}
}
To select the id of the author field, you can use its full name author.id. name will return
the field author.name:
curl -XGET "http://localhost:9200/publications/_mapping/article/field/author.id,abstract,name"
returns:
{
"publications": {
"article": {
"abstract": {
"full_name": "abstract",
"mapping": {
"abstract": { "type": "string" }
}
},
"author.id": {
"full_name": "author.id",
"mapping": {
"id": { "type": "string" }
}
},
"name": {
"full_name": "author.name",
"mapping": {
"name": { "type": "string" }
}
}
}
}
}
Note how the response always use the same fields specified in the request as keys.
The full_name in every entry contains the full name of the field whose mapping were returned.
This is useful when the request can refer to to multiple fields.
Other options
include_defaults
|
adding |
62. Types Exists
Used to check if a type/types exists in an index/indices.
curl -XHEAD -i 'http://localhost:9200/twitter/tweet'
The HTTP status code indicates if the type exists or not. A 404 means
it does not exist, and 200 means it does.
63. Index Aliases
APIs in elasticsearch accept an index name when working against a specific index, and several indices when applicable. The index aliases API allow to alias an index with a name, with all APIs automatically converting the alias name to the actual index name. An alias can also be mapped to more than one index, and when specifying it, the alias will automatically expand to the aliases indices. An alias can also be associated with a filter that will automatically be applied when searching, and routing values. An alias cannot have the same name as an index.
Here is a sample of associating the alias alias1 with index test1:
curl -XPOST 'http://localhost:9200/_aliases' -d '
{
"actions" : [
{ "add" : { "index" : "test1", "alias" : "alias1" } }
]
}'
An alias can also be removed, for example:
curl -XPOST 'http://localhost:9200/_aliases' -d '
{
"actions" : [
{ "remove" : { "index" : "test1", "alias" : "alias1" } }
]
}'
Renaming an alias is a simple remove then add operation within the
same API. This operation is atomic, no need to worry about a short
period of time where the alias does not point to an index:
curl -XPOST 'http://localhost:9200/_aliases' -d '
{
"actions" : [
{ "remove" : { "index" : "test1", "alias" : "alias1" } },
{ "add" : { "index" : "test1", "alias" : "alias2" } }
]
}'
Associating an alias with more than one index are simply several add
actions:
curl -XPOST 'http://localhost:9200/_aliases' -d '
{
"actions" : [
{ "add" : { "index" : "test1", "alias" : "alias1" } },
{ "add" : { "index" : "test2", "alias" : "alias1" } }
]
}'
Multiple indices can be specified for an action with the indices array syntax:
curl -XPOST 'http://localhost:9200/_aliases' -d '
{
"actions" : [
{ "add" : { "indices" : ["test1", "test2"], "alias" : "alias1" } }
]
}'
To specify multiple aliases in one action, the corresponding aliases array
syntax exists as well.
For the example above, a glob pattern can also be used to associate an alias to more than one index that share a common name:
curl -XPOST 'http://localhost:9200/_aliases' -d '
{
"actions" : [
{ "add" : { "index" : "test*", "alias" : "all_test_indices" } }
]
}'
In this case, the alias is a point-in-time alias that will group all current indices that match, it will not automatically update as new indices that match this pattern are added/removed.
It is an error to index to an alias which points to more than one index.
Filtered Aliases
Aliases with filters provide an easy way to create different "views" of the same index. The filter can be defined using Query DSL and is applied to all Search, Count, Delete By Query and More Like This operations with this alias.
To create a filtered alias, first we need to ensure that the fields already exist in the mapping:
curl -XPUT 'http://localhost:9200/test1' -d '{
"mappings": {
"type1": {
"properties": {
"user" : {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
Now we can create an alias that uses a filter on field user:
curl -XPOST 'http://localhost:9200/_aliases' -d '{
"actions" : [
{
"add" : {
"index" : "test1",
"alias" : "alias2",
"filter" : { "term" : { "user" : "kimchy" } }
}
}
]
}'
Routing
It is possible to associate routing values with aliases. This feature can be used together with filtering aliases in order to avoid unnecessary shard operations.
The following command creates a new alias alias1 that points to index
test. After alias1 is created, all operations with this alias are
automatically modified to use value 1 for routing:
curl -XPOST 'http://localhost:9200/_aliases' -d '
{
"actions" : [
{
"add" : {
"index" : "test",
"alias" : "alias1",
"routing" : "1"
}
}
]
}'
It’s also possible to specify different routing values for searching and indexing operations:
curl -XPOST 'http://localhost:9200/_aliases' -d '
{
"actions" : [
{
"add" : {
"index" : "test",
"alias" : "alias2",
"search_routing" : "1,2",
"index_routing" : "2"
}
}
]
}'
As shown in the example above, search routing may contain several values separated by comma. Index routing can contain only a single value.
If an operation that uses routing alias also has a routing parameter, an intersection of both alias routing and routing specified in the parameter is used. For example the following command will use "2" as a routing value:
curl -XGET 'http://localhost:9200/alias2/_search?q=user:kimchy&routing=2,3'
Add a single alias
An alias can also be added with the endpoint
PUT /{index}/_alias/{name}
where
index
|
The index the alias refers to. Can be any of |
name
|
The name of the alias. This is a required option. |
routing
|
An optional routing that can be associated with an alias. |
filter
|
An optional filter that can be associated with an alias. |
You can also use the plural _aliases.
Examples:
- Adding time based alias
-
curl -XPUT 'localhost:9200/logs_201305/_alias/2013' - Adding a user alias
-
First create the index and add a mapping for the
user_idfield:curl -XPUT 'localhost:9200/users' -d '{ "mappings" : { "user" : { "properties" : { "user_id" : {"type" : "integer"} } } } }'Then add the alias for a specific user:
curl -XPUT 'localhost:9200/users/_alias/user_12' -d '{ "routing" : "12", "filter" : { "term" : { "user_id" : 12 } } }'
Aliases during index creation
Aliases can also be specified during index creation:
curl -XPUT localhost:9200/logs_20142801 -d '{
"mappings" : {
"type" : {
"properties" : {
"year" : {"type" : "integer"}
}
}
},
"aliases" : {
"current_day" : {},
"2014" : {
"filter" : {
"term" : {"year" : 2014 }
}
}
}
}'
Delete aliases
The rest endpoint is: /{index}/_alias/{name}
where
index
|
|
name
|
|
Alternatively you can use the plural _aliases. Example:
curl -XDELETE 'localhost:9200/users/_alias/user_12'
Retrieving existing aliases
The get index alias api allows to filter by alias name and index name. This api redirects to the master and fetches the requested index aliases, if available. This api only serialises the found index aliases.
Possible options:
index
|
The index name to get aliases for. Partially names are supported via wildcards, also multiple index names can be specified separated with a comma. Also the alias name for an index can be used. |
alias
|
The name of alias to return in the response. Like the index option, this option supports wildcards and the option the specify multiple alias names separated by a comma. |
ignore_unavailable
|
What to do if an specified index name doesn’t
exist. If set to |
The rest endpoint is: /{index}/_alias/{alias}.
Examples:
All aliases for the index users:
curl -XGET 'localhost:9200/users/_alias/*'
Response:
{
"users" : {
"aliases" : {
"user_13" : {
"filter" : {
"term" : {
"user_id" : 13
}
},
"index_routing" : "13",
"search_routing" : "13"
},
"user_14" : {
"filter" : {
"term" : {
"user_id" : 14
}
},
"index_routing" : "14",
"search_routing" : "14"
},
"user_12" : {
"filter" : {
"term" : {
"user_id" : 12
}
},
"index_routing" : "12",
"search_routing" : "12"
}
}
}
}
All aliases with the name 2013 in any index:
curl -XGET 'localhost:9200/_alias/2013'
Response:
{
"logs_201304" : {
"aliases" : {
"2013" : { }
}
},
"logs_201305" : {
"aliases" : {
"2013" : { }
}
}
}
All aliases that start with 2013_01 in any index:
curl -XGET 'localhost:9200/_alias/2013_01*'
Response:
{
"logs_20130101" : {
"aliases" : {
"2013_01" : { }
}
}
}
There is also a HEAD variant of the get indices aliases api to check if index aliases exist. The indices aliases exists api supports the same option as the get indices aliases api. Examples:
curl -XHEAD -i 'localhost:9200/_alias/2013'
curl -XHEAD -i 'localhost:9200/_alias/2013_01*'
curl -XHEAD -i 'localhost:9200/users/_alias/*'
64. Update Indices Settings
Change specific index level settings in real time.
The REST endpoint is /_settings (to update all indices) or
{index}/_settings to update one (or more) indices settings. The body
of the request includes the updated settings, for example:
{
"index" : {
"number_of_replicas" : 4
}
}
The above will change the number of replicas to 4 from the current number of replicas. Here is a curl example:
curl -XPUT 'localhost:9200/my_index/_settings' -d '
{
"index" : {
"number_of_replicas" : 4
}
}'
The list of per-index settings which can be updated dynamically on live indices can be found in Index Modules.
Bulk Indexing Usage
For example, the update settings API can be used to dynamically change the index from being more performant for bulk indexing, and then move it to more real time indexing state. Before the bulk indexing is started, use:
curl -XPUT localhost:9200/test/_settings -d '{
"index" : {
"refresh_interval" : "-1"
} }'
(Another optimization option is to start the index without any replicas, and only later adding them, but that really depends on the use case).
Then, once bulk indexing is done, the settings can be updated (back to the defaults for example):
curl -XPUT localhost:9200/test/_settings -d '{
"index" : {
"refresh_interval" : "1s"
} }'
And, a force merge should be called:
curl -XPOST 'http://localhost:9200/test/_forcemerge?max_num_segments=5'
Updating Index Analysis
It is also possible to define new analyzers for the index. But it is required to close the index first and open it after the changes are made.
For example if content analyzer hasn’t been defined on myindex yet
you can use the following commands to add it:
curl -XPOST 'localhost:9200/myindex/_close'
curl -XPUT 'localhost:9200/myindex/_settings' -d '{
"analysis" : {
"analyzer":{
"content":{
"type":"custom",
"tokenizer":"whitespace"
}
}
}
}'
curl -XPOST 'localhost:9200/myindex/_open'
65. Get Settings
The get settings API allows to retrieve settings of index/indices:
$ curl -XGET 'http://localhost:9200/twitter/_settings'
Multiple Indices and Types
The get settings API can be used to get settings for more than one index
with a single call. General usage of the API follows the
following syntax: host:port/{index}/_settings where
{index} can stand for comma-separated list of index names and aliases. To
get settings for all indices you can use _all for {index}.
Wildcard expressions are also supported. The following are some examples:
curl -XGET 'http://localhost:9200/twitter,kimchy/_settings'
curl -XGET 'http://localhost:9200/_all/_settings'
curl -XGET 'http://localhost:9200/2013-*/_settings'
Filtering settings by name
The settings that are returned can be filtered with wildcard matching as follows:
curl -XGET 'http://localhost:9200/2013-*/_settings/name=index.number_*'
66. Analyze
Performs the analysis process on a text and return the tokens breakdown of the text.
Can be used without specifying an index against one of the many built in analyzers:
curl -XGET 'localhost:9200/_analyze' -d '
{
"analyzer" : "standard",
"text" : "this is a test"
}'
If text parameter is provided as array of strings, it is analyzed as a multi-valued field.
curl -XGET 'localhost:9200/_analyze' -d '
{
"analyzer" : "standard",
"text" : ["this is a test", "the second text"]
}'
Or by building a custom transient analyzer out of tokenizers, token filters and char filters. Token filters can use the shorter filters parameter name:
curl -XGET 'localhost:9200/_analyze' -d '
{
"tokenizer" : "keyword",
"filters" : ["lowercase"],
"text" : "this is a test"
}'
curl -XGET 'localhost:9200/_analyze' -d '
{
"tokenizer" : "keyword",
"token_filters" : ["lowercase"],
"char_filters" : ["html_strip"],
"text" : "this is a <b>test</b>"
}'
It can also run against a specific index:
curl -XGET 'localhost:9200/test/_analyze' -d '
{
"text" : "this is a test"
}'
The above will run an analysis on the "this is a test" text, using the
default index analyzer associated with the test index. An analyzer
can also be provided to use a different analyzer:
curl -XGET 'localhost:9200/test/_analyze' -d '
{
"analyzer" : "whitespace",
"text : "this is a test"
}'
Also, the analyzer can be derived based on a field mapping, for example:
curl -XGET 'localhost:9200/test/_analyze' -d '
{
"field" : "obj1.field1",
"text" : "this is a test"
}'
Will cause the analysis to happen based on the analyzer configured in the
mapping for obj1.field1 (and if not, the default index analyzer).
All parameters can also supplied as request parameters. For example:
curl -XGET 'localhost:9200/_analyze?tokenizer=keyword&filters=lowercase&text=this+is+a+test'
For backwards compatibility, we also accept the text parameter as the body of the request,
provided it doesn’t start with { :
curl -XGET 'localhost:9200/_analyze?tokenizer=keyword&token_filters=lowercase&char_filters=html_strip' -d 'this is a <b>test</b>'
66.1. Explain Analyze
If you want to get more advanced details, set explain to true (defaults to false). It will output all token attributes for each token.
You can filter token attributes you want to output by setting attributes option.
experimental[The format of the additional detail information is experimental and can change at any time]
GET test/_analyze
{
"tokenizer" : "standard",
"token_filters" : ["snowball"],
"text" : "detailed output",
"explain" : true,
"attributes" : ["keyword"]
}
| Set "keyword" to output "keyword" attribute only |
coming[2.0.0, body based parameters were added in 2.0.0]
The request returns the following result:
{
"detail" : {
"custom_analyzer" : true,
"charfilters" : [ ],
"tokenizer" : {
"name" : "standard",
"tokens" : [ {
"token" : "detailed",
"start_offset" : 0,
"end_offset" : 8,
"type" : "<ALPHANUM>",
"position" : 0
}, {
"token" : "output",
"start_offset" : 9,
"end_offset" : 15,
"type" : "<ALPHANUM>",
"position" : 1
} ]
},
"tokenfilters" : [ {
"name" : "snowball",
"tokens" : [ {
"token" : "detail",
"start_offset" : 0,
"end_offset" : 8,
"type" : "<ALPHANUM>",
"position" : 0,
"keyword" : false
}, {
"token" : "output",
"start_offset" : 9,
"end_offset" : 15,
"type" : "<ALPHANUM>",
"position" : 1,
"keyword" : false
} ]
} ]
}
}
| Output only "keyword" attribute, since specify "attributes" in the request. |
67. Index Templates
Index templates allow you to define templates that will automatically be applied when new indices are created. The templates include both settings and mappings, and a simple pattern template that controls whether the template should be applied to the new index.
|
|
Templates are only applied at index creation time. Changing a template will have no impact on existing indices. |
For example:
curl -XPUT localhost:9200/_template/template_1 -d '
{
"template" : "te*",
"settings" : {
"number_of_shards" : 1
},
"mappings" : {
"type1" : {
"_source" : { "enabled" : false }
}
}
}
'
Defines a template named template_1, with a template pattern of te*.
The settings and mappings will be applied to any index name that matches
the te* template.
It is also possible to include aliases in an index template as follows:
curl -XPUT localhost:9200/_template/template_1 -d '
{
"template" : "te*",
"settings" : {
"number_of_shards" : 1
},
"aliases" : {
"alias1" : {},
"alias2" : {
"filter" : {
"term" : {"user" : "kimchy" }
},
"routing" : "kimchy"
},
"{index}-alias" : {}
}
}
'
the {index} placeholder within the alias name will be replaced with the
actual index name that the template gets applied to during index creation. |
Deleting a Template
Index templates are identified by a name (in the above case
template_1) and can be deleted as well:
curl -XDELETE localhost:9200/_template/template_1
Getting templates
Index templates are identified by a name (in the above case
template_1) and can be retrieved using the following:
curl -XGET localhost:9200/_template/template_1
You can also match several templates by using wildcards like:
curl -XGET localhost:9200/_template/temp*
curl -XGET localhost:9200/_template/template_1,template_2
To get list of all index templates you can run:
curl -XGET localhost:9200/_template/
Templates exists
Used to check if the template exists or not. For example:
curl -XHEAD -i localhost:9200/_template/template_1
The HTTP status code indicates if the template with the given name
exists or not. A status code 200 means it exists, a 404 it does not.
Multiple Template Matching
Multiple index templates can potentially match an index, in this case,
both the settings and mappings are merged into the final configuration
of the index. The order of the merging can be controlled using the
order parameter, with lower order being applied first, and higher
orders overriding them. For example:
curl -XPUT localhost:9200/_template/template_1 -d '
{
"template" : "*",
"order" : 0,
"settings" : {
"number_of_shards" : 1
},
"mappings" : {
"type1" : {
"_source" : { "enabled" : false }
}
}
}
'
curl -XPUT localhost:9200/_template/template_2 -d '
{
"template" : "te*",
"order" : 1,
"settings" : {
"number_of_shards" : 1
},
"mappings" : {
"type1" : {
"_source" : { "enabled" : true }
}
}
}
'
The above will disable storing the _source on all type1 types, but
for indices of that start with te*, source will still be enabled.
Note, for mappings, the merging is "deep", meaning that specific
object/property based mappings can easily be added/overridden on higher
order templates, with lower order templates providing the basis.
68. Warmers
deprecated[2.3.0,Thanks to disk-based norms and doc values, warmers don’t have use-cases anymore]
Index warming allows to run registered search requests to warm up the index before it is available for search. With the near real time aspect of search, cold data (segments) will be warmed up before they become available for search. This includes things such as the filter cache, filesystem cache, and loading field data for fields.
Warmup searches typically include requests that require heavy loading of data, such as aggregations or sorting on specific fields. The warmup APIs allows to register warmup (search) under specific names, remove them, and get them.
Index warmup can be disabled by setting index.warmer.enabled to
false. It is supported as a realtime setting using update settings
API. This can be handy when doing initial bulk indexing: disable pre
registered warmers to make indexing faster and less expensive and then
enable it.
Index Creation / Templates
Warmers can be registered when an index gets created, for example:
curl -XPUT localhost:9200/test -d '{
"warmers" : {
"warmer_1" : {
"types" : [],
"source" : {
"query" : {
...
},
"aggs" : {
...
}
}
}
}
}'
Or, in an index template:
curl -XPUT localhost:9200/_template/template_1 -d '
{
"template" : "te*",
"warmers" : {
"warmer_1" : {
"types" : [],
"source" : {
"query" : {
...
},
"aggs" : {
...
}
}
}
}
}'
On the same level as types and source, the request_cache flag is supported
to enable request caching for the warmed search request. If not specified, it will
use the index level configuration of query caching.
Put Warmer
Allows to put a warmup search request on a specific index (or indices), with the body composing of a regular search request. Types can be provided as part of the URI if the search request is designed to be run only against the specific types.
Here is an example that registers a warmup called warmer_1 against
index test (can be alias or several indices), for a search request
that runs against all types:
curl -XPUT localhost:9200/test/_warmer/warmer_1 -d '{
"query" : {
"match_all" : {}
},
"aggs" : {
"aggs_1" : {
"terms" : {
"field" : "field"
}
}
}
}'
And an example that registers a warmup against specific types:
curl -XPUT localhost:9200/test/type1/_warmer/warmer_1 -d '{
"query" : {
"match_all" : {}
},
"aggs" : {
"aggs_1" : {
"terms" : {
"field" : "field"
}
}
}
}'
All options:
PUT _warmer/{warmer_name}
PUT /{index}/_warmer/{warmer_name}
PUT /{index}/{type}/_warmer/{warmer_name}
where
{index}
|
|
{type}
|
|
Instead of _warmer you can also use the plural _warmers.
The request_cache parameter can be used to enable request caching for
the search request. If not specified, it will use the index level configuration
of query caching.
Delete Warmers
Warmers can be deleted using the following endpoint:
[DELETE] /{index}/_warmer/{name}
where
{index}
|
|
{name}
|
|
Instead of _warmer you can also use the plural _warmers.
GETting Warmer
Getting a warmer for specific index (or alias, or several indices) based on its name. The provided name can be a simple wildcard expression or omitted to get all warmers.
Some examples:
# get warmer named warmer_1 on test index
curl -XGET localhost:9200/test/_warmer/warmer_1
# get all warmers that start with warm on test index
curl -XGET localhost:9200/test/_warmer/warm*
# get all warmers for test index
curl -XGET localhost:9200/test/_warmer/
69. Shadow replica indices
experimental[]
If you would like to use a shared filesystem, you can use the shadow replicas settings to choose where on disk the data for an index should be kept, as well as how Elasticsearch should replay operations on all the replica shards of an index.
In order to fully utilize the index.data_path and index.shadow_replicas
settings, you need to allow Elasticsearch to use the same data directory for
multiple instances by setting node.add_id_to_custom_path to false in
elasticsearch.yml:
node.add_id_to_custom_path: false
You will also need to indicate to the security manager where the custom indices
will be, so that the correct permissions can be applied. You can do this by
setting the path.shared_data setting in elasticsearch.yml:
path.shared_data: /opt/data
This means that Elasticsearch can read and write to files in any subdirectory of
the path.shared_data setting.
You can then create an index with a custom data path, where each node will use this path for the data:
|
|
Because shadow replicas do not index the document on replica shards, it’s possible for the replica’s known mapping to be behind the index’s known mapping if the latest cluster state has not yet been processed on the node containing the replica. Because of this, it is highly recommended to use pre-defined mappings when using shadow replicas. |
curl -XPUT 'localhost:9200/my_index' -d '
{
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 4,
"data_path": "/opt/data/my_index",
"shadow_replicas": true
}
}'
|
|
In the above example, the "/opt/data/my_index" path is a shared filesystem that
must be available on every node in the Elasticsearch cluster. You must also
ensure that the Elasticsearch process has the correct permissions to read from
and write to the directory used in the |
The data_path does not have to contain the index name, in this case,
"my_index" was used but it could easily also have been "/opt/data/"
An index that has been created with the index.shadow_replicas setting set to
"true" will not replicate document operations to any of the replica shards,
instead, it will only continually refresh. Once segments are available on the
filesystem where the shadow replica resides (after an Elasticsearch "flush"), a
regular refresh (governed by the index.refresh_interval) can be used to make
the new data searchable.
|
|
Since documents are only indexed on the primary shard, realtime GET
requests could fail to return a document if executed on the replica shard,
therefore, GET API requests automatically have the ?preference=_primary flag
set if there is no preference flag already set.
|
In order to ensure the data is being synchronized in a fast enough manner, you may need to tune the flush threshold for the index to a desired number. A flush is needed to fsync segment files to disk, so they will be visible to all other replica nodes. Users should test what flush threshold levels they are comfortable with, as increased flushing can impact indexing performance.
The Elasticsearch cluster will still detect the loss of a primary shard, and
transform the replica into a primary in this situation. This transformation will
take slightly longer, since no IndexWriter is maintained for each shadow
replica.
Below is the list of settings that can be changed using the update settings API:
index.data_path(string)-
Path to use for the index’s data. Note that by default Elasticsearch will append the node ordinal by default to the path to ensure multiple instances of Elasticsearch on the same machine do not share a data directory.
index.shadow_replicas-
Boolean value indicating this index should use shadow replicas. Defaults to
false. index.shared_filesystem-
Boolean value indicating this index uses a shared filesystem. Defaults to the
trueifindex.shadow_replicasis set to true,falseotherwise. index.shared_filesystem.recover_on_any_node-
Boolean value indicating whether the primary shards for the index should be allowed to recover on any node in the cluster, regardless of the number of replicas or whether the node has previously had the shard allocated to it before. Defaults to
false.
69.1. Node level settings related to shadow replicas
These are non-dynamic settings that need to be configured in elasticsearch.yml
node.add_id_to_custom_path-
Boolean setting indicating whether Elasticsearch should append the node’s ordinal to the custom data path. For example, if this is enabled and a path of "/tmp/foo" is used, the first locally-running node will use "/tmp/foo/0", the second will use "/tmp/foo/1", the third "/tmp/foo/2", etc. Defaults to
true.
70. Indices Stats
Indices level stats provide statistics on different operations happening on an index. The API provides statistics on the index level scope (though most stats can also be retrieved using node level scope).
The following returns high level aggregation and index level stats for all indices:
curl localhost:9200/_stats
Specific index stats can be retrieved using:
curl localhost:9200/index1,index2/_stats
By default, all stats are returned, returning only specific stats can be specified as well in the URI. Those stats can be any of:
docs
|
The number of docs / deleted docs (docs not yet merged out). Note, affected by refreshing the index. |
store
|
The size of the index. |
indexing
|
Indexing statistics, can be combined with a comma
separated list of |
get
|
Get statistics, including missing stats. |
search
|
Search statistics. You can include statistics for custom groups by adding
an extra |
completion
|
Completion suggest statistics. |
fielddata
|
Fielddata statistics. |
flush
|
Flush statistics. |
merge
|
Merge statistics. |
request_cache
|
Shard request cache statistics. |
refresh
|
Refresh statistics. |
suggest
|
Suggest statistics. |
warmer
|
Warmer statistics. |
translog
|
Translog statistics. |
Some statistics allow per field granularity which accepts a list comma-separated list of included fields. By default all fields are included:
fields
|
List of fields to be included in the statistics. This is used as the default list unless a more specific field list is provided (see below). |
completion_fields
|
List of fields to be included in the Completion Suggest statistics. |
fielddata_fields
|
List of fields to be included in the Fielddata statistics. |
Here are some samples:
# Get back stats for merge and refresh only for all indices
curl 'localhost:9200/_stats/merge,refresh'
# Get back stats for type1 and type2 documents for the my_index index
curl 'localhost:9200/my_index/_stats/indexing?types=type1,type2
# Get back just search stats for group1 and group2
curl 'localhost:9200/_stats/search?groups=group1,group2
The stats returned are aggregated on the index level, with
primaries and total aggregations, where primaries are the values for only the
primary shards, and total are the cumulated values for both primary and replica shards.
In order to get back shard level stats, set the level parameter to shards.
Note, as shards move around the cluster, their stats will be cleared as they are created on other nodes. On the other hand, even though a shard "left" a node, that node will still retain the stats that shard contributed to.
71. Indices Segments
Provide low level segments information that a Lucene index (shard level) is built with. Allows to be used to provide more information on the state of a shard and an index, possibly optimization information, data "wasted" on deletes, and so on.
Endpoints include segments for a specific index, several indices, or all:
curl -XGET 'http://localhost:9200/test/_segments'
curl -XGET 'http://localhost:9200/test1,test2/_segments'
curl -XGET 'http://localhost:9200/_segments'
Response:
{
...
"_3": {
"generation": 3,
"num_docs": 1121,
"deleted_docs": 53,
"size_in_bytes": 228288,
"memory_in_bytes": 3211,
"committed": true,
"search": true,
"version": "4.6",
"compound": true
}
...
}
- _0
-
The key of the JSON document is the name of the segment. This name is used to generate file names: all files starting with this segment name in the directory of the shard belong to this segment.
- generation
-
A generation number that is basically incremented when needing to write a new segment. The segment name is derived from this generation number.
- num_docs
-
The number of non-deleted documents that are stored in this segment.
- deleted_docs
-
The number of deleted documents that are stored in this segment. It is perfectly fine if this number is greater than 0, space is going to be reclaimed when this segment gets merged.
- size_in_bytes
-
The amount of disk space that this segment uses, in bytes.
- memory_in_bytes
-
Segments need to store some data into memory in order to be searchable efficiently. This number returns the number of bytes that are used for that purpose. A value of -1 indicates that Elasticsearch was not able to compute this number.
- committed
-
Whether the segment has been sync’ed on disk. Segments that are committed would survive a hard reboot. No need to worry in case of false, the data from uncommitted segments is also stored in the transaction log so that Elasticsearch is able to replay changes on the next start.
- search
-
Whether the segment is searchable. A value of false would most likely mean that the segment has been written to disk but no refresh occurred since then to make it searchable.
- version
-
The version of Lucene that has been used to write this segment.
- compound
-
Whether the segment is stored in a compound file. When true, this means that Lucene merged all files from the segment in a single one in order to save file descriptors.
Verbose mode
To add additional information that can be used for debugging, use the verbose flag.
experimental[The format of the additional verbose information is experimental and can change at any time]
curl -XGET 'http://localhost:9200/test/_segments?verbose=true'
Response:
{
...
"_3": {
...
"ram_tree": [
{
"description": "postings [PerFieldPostings(format=1)]",
"size_in_bytes": 2696,
"children": [
{
"description": "format 'Lucene50_0' ...",
"size_in_bytes": 2608,
"children" :[ ... ]
},
...
]
},
...
]
}
...
}
72. Indices Recovery
The indices recovery API provides insight into on-going index shard recoveries. Recovery status may be reported for specific indices, or cluster-wide.
For example, the following command would show recovery information for the indices "index1" and "index2".
curl -XGET http://localhost:9200/index1,index2/_recovery
To see cluster-wide recovery status simply leave out the index names.
curl -XGET http://localhost:9200/_recovery?pretty&human
Response:
{
"index1" : {
"shards" : [ {
"id" : 0,
"type" : "SNAPSHOT",
"stage" : "INDEX",
"primary" : true,
"start_time" : "2014-02-24T12:15:59.716",
"start_time_in_millis": 1393244159716,
"total_time" : "2.9m"
"total_time_in_millis" : 175576,
"source" : {
"repository" : "my_repository",
"snapshot" : "my_snapshot",
"index" : "index1"
},
"target" : {
"id" : "ryqJ5lO5S4-lSFbGntkEkg",
"hostname" : "my.fqdn",
"ip" : "10.0.1.7",
"name" : "my_es_node"
},
"index" : {
"size" : {
"total" : "75.4mb"
"total_in_bytes" : 79063092,
"reused" : "0b",
"reused_in_bytes" : 0,
"recovered" : "65.7mb",
"recovered_in_bytes" : 68891939,
"percent" : "87.1%"
},
"files" : {
"total" : 73,
"reused" : 0,
"recovered" : 69,
"percent" : "94.5%"
},
"total_time" : "0s",
"total_time_in_millis" : 0
},
"translog" : {
"recovered" : 0,
"total" : 0,
"percent" : "100.0%",
"total_on_start" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
},
"start" : {
"check_index_time" : "0s",
"check_index_time_in_millis" : 0,
"total_time" : "0s",
"total_time_in_millis" : 0
}
} ]
}
}
The above response shows a single index recovering a single shard. In this case, the source of the recovery is a snapshot repository and the target of the recovery is the node with name "my_es_node".
Additionally, the output shows the number and percent of files recovered, as well as the number and percent of bytes recovered.
In some cases a higher level of detail may be preferable. Setting "detailed=true" will present a list of physical files in recovery.
curl -XGET http://localhost:9200/_recovery?pretty&human&detailed=true
Response:
{
"index1" : {
"shards" : [ {
"id" : 0,
"type" : "STORE",
"stage" : "DONE",
"primary" : true,
"start_time" : "2014-02-24T12:38:06.349",
"start_time_in_millis" : "1393245486349",
"stop_time" : "2014-02-24T12:38:08.464",
"stop_time_in_millis" : "1393245488464",
"total_time" : "2.1s",
"total_time_in_millis" : 2115,
"source" : {
"id" : "RGMdRc-yQWWKIBM4DGvwqQ",
"hostname" : "my.fqdn",
"ip" : "10.0.1.7",
"name" : "my_es_node"
},
"target" : {
"id" : "RGMdRc-yQWWKIBM4DGvwqQ",
"hostname" : "my.fqdn",
"ip" : "10.0.1.7",
"name" : "my_es_node"
},
"index" : {
"size" : {
"total" : "24.7mb",
"total_in_bytes" : 26001617,
"reused" : "24.7mb",
"reused_in_bytes" : 26001617,
"recovered" : "0b",
"recovered_in_bytes" : 0,
"percent" : "100.0%"
},
"files" : {
"total" : 26,
"reused" : 26,
"recovered" : 0,
"percent" : "100.0%",
"details" : [ {
"name" : "segments.gen",
"length" : 20,
"recovered" : 20
}, {
"name" : "_0.cfs",
"length" : 135306,
"recovered" : 135306
}, {
"name" : "segments_2",
"length" : 251,
"recovered" : 251
},
...
]
},
"total_time" : "2ms",
"total_time_in_millis" : 2
},
"translog" : {
"recovered" : 71,
"total_time" : "2.0s",
"total_time_in_millis" : 2025
},
"start" : {
"check_index_time" : 0,
"total_time" : "88ms",
"total_time_in_millis" : 88
}
} ]
}
}
This response shows a detailed listing (truncated for brevity) of the actual files recovered and their sizes.
Also shown are the timings in milliseconds of the various stages of recovery: index retrieval, translog replay, and index start time.
Note that the above listing indicates that the recovery is in stage "done". All recoveries, whether on-going or complete, are kept in cluster state and may be reported on at any time. Setting "active_only=true" will cause only on-going recoveries to be reported.
Here is a complete list of options:
detailed
|
Display a detailed view. This is primarily useful for viewing the recovery of physical index files. Default: false. |
active_only
|
Display only those recoveries that are currently on-going. Default: false. |
Description of output fields:
id
|
Shard ID |
type
|
Recovery type:
|
stage
|
Recovery stage:
|
primary
|
True if shard is primary, false otherwise |
start_time
|
Timestamp of recovery start |
stop_time
|
Timestamp of recovery finish |
total_time_in_millis
|
Total time to recover shard in milliseconds |
source
|
Recovery source:
|
target
|
Destination node |
index
|
Statistics about physical index recovery |
translog
|
Statistics about translog recovery |
start
|
Statistics about time to open and start the index |
73. Indices Shard Stores
Provides store information for shard copies of indices. Store information reports on which nodes shard copies exist, the shard copy version, indicating how recent they are, and any exceptions encountered while opening the shard index or from earlier engine failure.
By default, only lists store information for shards that have at least one unallocated copy. When the cluster health status is yellow, this will list store information for shards that have at least one unassigned replica. When the cluster health status is red, this will list store information for shards, which has unassigned primaries.
Endpoints include shard stores information for a specific index, several indices, or all:
curl -XGET 'http://localhost:9200/test/_shard_stores'
curl -XGET 'http://localhost:9200/test1,test2/_shard_stores'
curl -XGET 'http://localhost:9200/_shard_stores'
The scope of shards to list store information can be changed through
status param. Defaults to yellow and red. yellow lists store information of
shards with at least one unassigned replica and red for shards with unassigned
primary shard.
Use green to list store information for shards with all assigned copies.
curl -XGET 'http://localhost:9200/_shard_stores?status=green'
Response:
The shard stores information is grouped by indices and shard ids.
{
...
"0": {
"stores": [
{
"sPa3OgxLSYGvQ4oPs-Tajw": {
"name": "node_t0",
"transport_address": "local[1]",
"attributes": {
"mode": "local"
}
},
"version": 4,
"allocation" : "primary" | "replica" | "unused",
"store_exception": ...
},
...
]
},
...
}
| The key is the corresponding shard id for the store information | |
| A list of store information for all copies of the shard | |
| The node information that hosts a copy of the store, the key is the unique node id. | |
| The version of the store copy | |
| The status of the store copy, whether it is used as a primary, replica or not used at all | |
| Any exception encountered while opening the shard index or from earlier engine failure |
74. Clear Cache
The clear cache API allows to clear either all caches or specific cached associated with one or more indices.
$ curl -XPOST 'http://localhost:9200/twitter/_cache/clear'
The API, by default, will clear all caches. Specific caches can be cleaned
explicitly by setting query, fielddata or request.
All caches relating to a specific field(s) can also be cleared by
specifying fields parameter with a comma delimited list of the
relevant fields.
Multi Index
The clear cache API can be applied to more than one index with a single
call, or even on _all the indices.
$ curl -XPOST 'http://localhost:9200/kimchy,elasticsearch/_cache/clear'
$ curl -XPOST 'http://localhost:9200/_cache/clear'
75. Flush
The flush API allows to flush one or more indices through an API. The flush process of an index basically frees memory from the index by flushing data to the index storage and clearing the internal transaction log. By default, Elasticsearch uses memory heuristics in order to automatically trigger flush operations as required in order to clear memory.
POST /twitter/_flush
Request Parameters
The flush API accepts the following request parameters:
wait_if_ongoing
|
If set to |
force
|
Whether a flush should be forced even if it is not necessarily needed ie. if no changes will be committed to the index. This is useful if transaction log IDs should be incremented even if no uncommitted changes are present. (This setting can be considered as internal) |
Multi Index
The flush API can be applied to more than one index with a single call,
or even on _all the indices.
POST /kimchy,elasticsearch/_flush
POST /_flush
75.1. Synced Flush
Elasticsearch tracks the indexing activity of each shard. Shards that have not
received any indexing operations for 5 minutes are automatically marked as inactive. This presents
an opportunity for Elasticsearch to reduce shard resources and also perform
a special kind of flush, called synced flush. A synced flush performs a normal flush, then adds
a generated unique marker (sync_id) to all shards.
Since the sync id marker was added when there were no ongoing indexing operations, it can be used as a quick way to check if the two shards' lucene indices are identical. This quick sync id comparison (if present) is used during recovery or restarts to skip the first and most costly phase of the process. In that case, no segment files need to be copied and the transaction log replay phase of the recovery can start immediately. Note that since the sync id marker was applied together with a flush, it is very likely that the transaction log will be empty, speeding up recoveries even more.
This is particularly useful for use cases having lots of indices which are never or very rarely updated, such as time based data. This use case typically generates lots of indices whose recovery without the synced flush marker would take a long time.
To check whether a shard has a marker or not, look for the commit section of shard stats returned by
the indices stats API:
GET /twitter/_stats/commit?level=shards
which returns something similar to:
{
...
"indices": {
"twitter": {
"primaries": {},
"total": {},
"shards": {
"0": [
{
"routing": {
...
},
"commit": {
"id": "te7zF7C4UsirqvL6jp/vUg==",
"generation": 2,
"user_data": {
"sync_id": "AU2VU0meX-VX2aNbEUsD" <1>,
...
},
"num_docs": 0
}
}
...
],
...
}
}
}
}
the sync id marker |
Synced Flush API
The Synced Flush API allows an administrator to initiate a synced flush manually. This can be particularly useful for a planned (rolling) cluster restart where you can stop indexing and don’t want to wait the default 5 minutes for idle indices to be sync-flushed automatically.
While handy, there are a couple of caveats for this API:
-
Synced flush is a best effort operation. Any ongoing indexing operations will cause the synced flush to fail on that shard. This means that some shards may be synced flushed while others aren’t. See below for more.
-
The
sync_idmarker is removed as soon as the shard is flushed again. That is because a flush replaces the low level lucene commit point where the marker is stored. Uncommitted operations in the transaction log do not remove the marker. In practice, one should consider any indexing operation on an index as removing the marker as a flush can be triggered by Elasticsearch at any time.
|
|
It is harmless to request a synced flush while there is ongoing indexing. Shards that are idle will succeed and shards that are not will fail. Any shards that succeeded will have faster recovery times. |
POST /twitter/_flush/synced
The response contains details about how many shards were successfully sync-flushed and information about any failure.
Here is what it looks like when all shards of a two shards and one replica index successfully sync-flushed:
{
"_shards": {
"total": 4,
"successful": 4,
"failed": 0
},
"twitter": {
"total": 4,
"successful": 4,
"failed": 0
}
}
Here is what it looks like when one shard group failed due to pending operations:
{
"_shards": {
"total": 4,
"successful": 2,
"failed": 2
},
"twitter": {
"total": 4,
"successful": 2,
"failed": 2,
"failures": [
{
"shard": 1,
"reason": "[2] ongoing operations on primary"
}
]
}
}
|
|
The above error is shown when the synced flush fails due to concurrent indexing operations. The HTTP
status code in that case will be 409 CONFLICT.
|
Sometimes the failures are specific to a shard copy. The copies that failed will not be eligible for fast recovery but those that succeeded still will be. This case is reported as follows:
{
"_shards": {
"total": 4,
"successful": 1,
"failed": 1
},
"twitter": {
"total": 4,
"successful": 3,
"failed": 1,
"failures": [
{
"shard": 1,
"reason": "unexpected error",
"routing": {
"state": "STARTED",
"primary": false,
"node": "SZNr2J_ORxKTLUCydGX4zA",
"relocating_node": null,
"shard": 1,
"index": "twitter"
}
}
]
}
}
|
|
When a shard copy fails to sync-flush, the HTTP status code returned will be 409 CONFLICT.
|
The synced flush API can be applied to more than one index with a single call,
or even on _all the indices.
POST /kimchy,elasticsearch/_flush/synced
POST /_flush/synced
76. Refresh
The refresh API allows to explicitly refresh one or more index, making all operations performed since the last refresh available for search. The (near) real-time capabilities depend on the index engine used. For example, the internal one requires refresh to be called, but by default a refresh is scheduled periodically.
$ curl -XPOST 'http://localhost:9200/twitter/_refresh'
Multi Index
The refresh API can be applied to more than one index with a single
call, or even on _all the indices.
$ curl -XPOST 'http://localhost:9200/kimchy,elasticsearch/_refresh'
$ curl -XPOST 'http://localhost:9200/_refresh'
77. Force Merge
The force merge API allows to force merging of one or more indices through an API. The merge relates to the number of segments a Lucene index holds within each shard. The force merge operation allows to reduce the number of segments by merging them.
This call will block until the merge is complete. If the http connection is lost, the request will continue in the background, and any new requests will block until the previous force merge is complete.
$ curl -XPOST 'http://localhost:9200/twitter/_forcemerge'
Request Parameters
The force merge API accepts the following request parameters:
max_num_segments
|
The number of segments to merge to. To fully
merge the index, set it to |
only_expunge_deletes
|
Should the merge process only expunge segments with
deletes in it. In Lucene, a document is not deleted from a segment, just marked
as deleted. During a merge process of segments, a new segment is created that
does not have those deletes. This flag allows to only merge segments that have
deletes. Defaults to |
flush
|
Should a flush be performed after the forced merge. Defaults to
|
Multi Index
The force merge API can be applied to more than one index with a single call, or
even on _all the indices.
$ curl -XPOST 'http://localhost:9200/kimchy,elasticsearch/_forcemerge'
$ curl -XPOST 'http://localhost:9200/_forcemerge'
78. Optimize
deprecated[2.1.0,Optimize API has been renamed to the force merge API]
The optimize API allows to optimize one or more indices through an API. The optimize process basically optimizes the index for faster search operations (and relates to the number of segments a Lucene index holds within each shard). The optimize operation allows to reduce the number of segments by merging them.
This call will block until the optimize is complete. If the http connection is lost, the request will continue in the background, and any new requests will block until the previous optimize is complete.
$ curl -XPOST 'http://localhost:9200/twitter/_optimize'
Request Parameters
The optimize API accepts the following request parameters as query arguments:
max_num_segments
|
The number of segments to optimize to. To fully
optimize the index, set it to |
only_expunge_deletes
|
Should the optimize process only expunge segments with
deletes in it. In Lucene, a document is not deleted from a segment, just marked
as deleted. During a merge process of segments, a new segment is created that
does not have those deletes. This flag allows to only merge segments that have
deletes. Defaults to |
flush
|
Should a flush be performed after the optimize. Defaults to
|
Multi Index
The optimize API can be applied to more than one index with a single
call, or even on _all the indices.
$ curl -XPOST 'http://localhost:9200/kimchy,elasticsearch/_optimize'
$ curl -XPOST 'http://localhost:9200/_optimize?only_expunge_deletes=true'
79. Upgrade
The upgrade API allows to upgrade one or more indices to the latest Lucene format through an API. The upgrade process converts any segments written with older formats.
|
|
The upgrade API in its current form will not help you to migrate indices created in Elasticsearch 1.x to 5.x. The upgrade API rewrites an index in the latest Lucene format, but it still retains the original data structures that were used when the index was first created. For instance:
Migrating 1.x indices to 5.x The only way to prepare an index created in 1.x for use in 5.x is to reindex your data in a cluster running Elasticsearch 2.3.x, which you can do with the new reindex API. The steps to do this are as follows:
In the future, we plan to change the upgrade API to perform a reindex-in-
place. In other words, it would reindex data from |
Start an upgrade
$ curl -XPOST 'http://localhost:9200/twitter/_upgrade'
|
|
Upgrading is an I/O intensive operation, and is limited to processing a single shard per node at a time. It also is not allowed to run at the same time as an optimize/force-merge. |
This call will block until the upgrade is complete. If the http connection is lost, the request will continue in the background, and any new requests will block until the previous upgrade is complete.
Request Parameters
The upgrade API accepts the following request parameters:
only_ancient_segments
|
If true, only very old segments (from a
previous Lucene major release) will be upgraded. While this will do
the minimal work to ensure the next major release of Elasticsearch can
read the segments, it’s dangerous because it can leave other very old
segments in sub-optimal formats. Defaults to |
Check upgrade status
Use a GET request to monitor how much of an index is upgraded. This
can also be used prior to starting an upgrade to identify which
indices you want to upgrade at the same time.
The ancient byte values that are returned indicate total bytes of
segments whose version is extremely old (Lucene major version is
different from the current version), showing how much upgrading is
necessary when you run with only_ancient_segments=true.
curl 'http://localhost:9200/twitter/_upgrade?pretty&human'
{
"size": "21gb",
"size_in_bytes": "21000000000",
"size_to_upgrade": "10gb",
"size_to_upgrade_in_bytes": "10000000000"
"size_to_upgrade_ancient": "1gb",
"size_to_upgrade_ancient_in_bytes": "1000000000"
"indices": {
"twitter": {
"size": "21gb",
"size_in_bytes": "21000000000",
"size_to_upgrade": "10gb",
"size_to_upgrade_in_bytes": "10000000000"
"size_to_upgrade_ancient": "1gb",
"size_to_upgrade_ancient_in_bytes": "1000000000"
}
}
}
The level of details in the upgrade status command can be controlled by
setting level parameter to cluster, index (default) or shard levels.
For example, you can run the upgrade status command with level=shard to
get detailed upgrade information of each individual shard.
cat APIs
Introduction
JSON is great… for computers. Even if it’s pretty-printed, trying to find relationships in the data is tedious. Human eyes, especially when looking at an ssh terminal, need compact and aligned text. The cat API aims to meet this need.
All the cat commands accept a query string parameter help to see all
the headers and info they provide, and the /_cat command alone lists all
the available commands.
Common parameters
Verbose
Each of the commands accepts a query string parameter v to turn on
verbose output.
% curl 'localhost:9200/_cat/master?v'
id ip node
EGtKWZlWQYWDmX29fUnp3Q 127.0.0.1 Grey, Sara
Help
Each of the commands accepts a query string parameter help which will
output its available columns.
% curl 'localhost:9200/_cat/master?help'
id | node id
ip | node transport ip address
node | node name
Headers
Each of the commands accepts a query string parameter h which forces
only those columns to appear.
% curl 'n1:9200/_cat/nodes?h=ip,port,heapPercent,name'
192.168.56.40 9300 40.3 Captain Universe
192.168.56.20 9300 15.3 Kaluu
192.168.56.50 9300 17.0 Yellowjacket
192.168.56.10 9300 12.3 Remy LeBeau
192.168.56.30 9300 43.9 Ramsey, Doug
You can also request multiple columns using simple wildcards like
/_cat/thread_pool?h=ip,bulk.* to get all headers (or aliases) starting
with bulk..
Numeric formats
Many commands provide a few types of numeric output, either a byte
value or a time value. By default, these types are human-formatted,
for example, 3.5mb instead of 3763212. The human values are not
sortable numerically, so in order to operate on these values where
order is important, you can change it.
Say you want to find the largest index in your cluster (storage used
by all the shards, not number of documents). The /_cat/indices API
is ideal. We only need to tweak two things. First, we want to turn
off human mode. We’ll use a byte-level resolution. Then we’ll pipe
our output into sort using the appropriate column, which in this
case is the eight one.
% curl '192.168.56.10:9200/_cat/indices?bytes=b' | sort -rnk8
green wiki2 3 0 10000 0 105274918 105274918
green wiki1 3 0 10000 413 103776272 103776272
green foo 1 0 227 0 2065131 2065131
80. cat aliases
aliases shows information about currently configured aliases to indices
including filter and routing infos.
% curl '192.168.56.10:9200/_cat/aliases?v'
alias index filter routing.index routing.search
alias2 test1 * - -
alias4 test1 - 2 1,2
alias1 test1 - - -
alias3 test1 - 1 1
The output shows that alias has configured a filter, and specific routing
configurations in alias3 and alias4.
If you only want to get information about a single alias, you can specify
the alias in the URL, for example /_cat/aliases/alias1.
81. cat allocation
allocation provides a snapshot of how many shards are allocated to each data node
and how much disk space they are using.
% curl '192.168.56.10:9200/_cat/allocation?v'
shards diskUsed diskAvail diskRatio ip node
1 5.6gb 72.2gb 7.8% 192.168.56.10 Jarella
1 5.6gb 72.2gb 7.8% 192.168.56.30 Solarr
1 5.5gb 72.3gb 7.6% 192.168.56.20 Adam II
Here we can see that each node has been allocated a single shard and that they’re all using about the same amount of space.
82. cat count
count provides quick access to the document count of the entire
cluster, or individual indices.
% curl 192.168.56.10:9200/_cat/indices
green wiki1 3 0 10000 331 168.5mb 168.5mb
green wiki2 3 0 428 0 8mb 8mb
% curl 192.168.56.10:9200/_cat/count
1384314124582 19:42:04 10428
% curl 192.168.56.10:9200/_cat/count/wiki2
1384314139815 19:42:19 428
|
|
The document count indicates the number of live documents and does not include deleted documents which have not yet been cleaned up by the merge process. |
83. cat fielddata
fielddata shows how much heap memory is currently being used by fielddata
on every data node in the cluster.
% curl '192.168.56.10:9200/_cat/fielddata?v'
id host ip node total body text
c223lARiSGeezlbrcugAYQ myhost1 10.20.100.200 Jessica Jones 385.6kb 159.8kb 225.7kb
waPCbitNQaCL6xC8VxjAwg myhost2 10.20.100.201 Adversary 435.2kb 159.8kb 275.3kb
yaDkp-G3R0q1AJ-HUEvkSQ myhost3 10.20.100.202 Microchip 284.6kb 109.2kb 175.3kb
Fields can be specified either as a query parameter, or in the URL path:
% curl '192.168.56.10:9200/_cat/fielddata?v&fields=body'
id host ip node total body
c223lARiSGeezlbrcugAYQ myhost1 10.20.100.200 Jessica Jones 385.6kb 159.8kb
waPCbitNQaCL6xC8VxjAwg myhost2 10.20.100.201 Adversary 435.2kb 159.8kb
yaDkp-G3R0q1AJ-HUEvkSQ myhost3 10.20.100.202 Microchip 284.6kb 109.2kb
% curl '192.168.56.10:9200/_cat/fielddata/body,text?v'
id host ip node total body text
c223lARiSGeezlbrcugAYQ myhost1 10.20.100.200 Jessica Jones 385.6kb 159.8kb 225.7kb
waPCbitNQaCL6xC8VxjAwg myhost2 10.20.100.201 Adversary 435.2kb 159.8kb 275.3kb
yaDkp-G3R0q1AJ-HUEvkSQ myhost3 10.20.100.202 Microchip 284.6kb 109.2kb 175.3kb
The output shows the total fielddata and then the individual fielddata for the
body and text fields.
84. cat health
health is a terse, one-line representation of the same information
from /_cluster/health. It has one option ts to disable the
timestamping.
% curl localhost:9200/_cat/health
1384308967 18:16:07 foo green 3 3 3 3 0 0 0
% curl 'localhost:9200/_cat/health?v&ts=0'
cluster status nodeTotal nodeData shards pri relo init unassign tasks
foo green 3 3 3 3 0 0 0 0
A common use of this command is to verify the health is consistent across nodes:
% pssh -i -h list.of.cluster.hosts curl -s localhost:9200/_cat/health
[1] 20:20:52 [SUCCESS] es3.vm
1384309218 18:20:18 foo green 3 3 3 3 0 0 0 0
[2] 20:20:52 [SUCCESS] es1.vm
1384309218 18:20:18 foo green 3 3 3 3 0 0 0 0
[3] 20:20:52 [SUCCESS] es2.vm
1384309218 18:20:18 foo green 3 3 3 3 0 0 0 0
A less obvious use is to track recovery of a large cluster over time. With enough shards, starting a cluster, or even recovering after losing a node, can take time (depending on your network & disk). A way to track its progress is by using this command in a delayed loop:
% while true; do curl localhost:9200/_cat/health; sleep 120; done
1384309446 18:24:06 foo red 3 3 20 20 0 0 1812 0
1384309566 18:26:06 foo yellow 3 3 950 916 0 12 870 0
1384309686 18:28:06 foo yellow 3 3 1328 916 0 12 492 0
1384309806 18:30:06 foo green 3 3 1832 916 4 0 0
^C
In this scenario, we can tell that recovery took roughly four minutes.
If this were going on for hours, we would be able to watch the
UNASSIGNED shards drop precipitously. If that number remained
static, we would have an idea that there is a problem.
Why the timestamp?
You typically are using the health command when a cluster is
malfunctioning. During this period, it’s extremely important to
correlate activities across log files, alerting systems, etc.
There are two outputs. The HH:MM:SS output is simply for quick
human consumption. The epoch time retains more information, including
date, and is machine sortable if your recovery spans days.
85. cat indices
The indices command provides a cross-section of each index. This
information spans nodes.
% curl 'localhost:9200/_cat/indices/twi*?v'
health status index pri rep docs.count docs.deleted store.size pri.store.size
green open twitter 5 1 11434 0 64mb 32mb
green open twitter2 2 0 2030 0 5.8mb 5.8mb
We can tell quickly how many shards make up an index, the number of docs, deleted docs, primary store size, and total store size (all shards including replicas).
Primaries
The index stats by default will show them for all of an index’s
shards, including replicas. A pri flag can be supplied to enable
the view of relevant stats in the context of only the primaries.
Examples
Which indices are yellow?
% curl localhost:9200/_cat/indices | grep ^yell
yellow open wiki 2 1 6401 1115 151.4mb 151.4mb
yellow open twitter 5 1 11434 0 32mb 32mb
What’s my largest index by disk usage not including replicas?
% curl 'localhost:9200/_cat/indices?bytes=b' | sort -rnk8
green open wiki 2 0 6401 1115 158843725 158843725
green open twitter 5 1 11434 0 67155614 33577857
green open twitter2 2 0 2030 0 6125085 6125085
How many merge operations have the shards for the wiki completed?
% curl 'localhost:9200/_cat/indices/wiki?pri&v&h=health,index,prirep,docs.count,mt'
health index docs.count mt pri.mt
green wiki 9646 16 16
How much memory is used per index?
% curl 'localhost:9200/_cat/indices?v&h=i,tm'
i tm
wiki 8.1gb
test 30.5kb
user 1.9mb
86. cat master
master doesn’t have any extra options. It simply displays the
master’s node ID, bound IP address, and node name.
% curl 'localhost:9200/_cat/master?v'
id ip node
Ntgn2DcuTjGuXlhKDUD4vA 192.168.56.30 Solarr
This information is also available via the nodes command, but this
is slightly shorter when all you want to do, for example, is verify
all nodes agree on the master:
% pssh -i -h list.of.cluster.hosts curl -s localhost:9200/_cat/master
[1] 19:16:37 [SUCCESS] es3.vm
Ntgn2DcuTjGuXlhKDUD4vA 192.168.56.30 Solarr
[2] 19:16:37 [SUCCESS] es2.vm
Ntgn2DcuTjGuXlhKDUD4vA 192.168.56.30 Solarr
[3] 19:16:37 [SUCCESS] es1.vm
Ntgn2DcuTjGuXlhKDUD4vA 192.168.56.30 Solarr
87. cat nodeattrs
The nodeattrs command shows custom node attributes.
% curl 192.168.56.10:9200/_cat/nodeattrs
node host ip attr value
Black Bolt epsilon 192.168.1.8 rack rack314
Black Bolt epsilon 192.168.1.8 azone us-east-1
The first few columns give you basic info per node.
node host ip
Black Bolt epsilon 192.168.1.8
Black Bolt epsilon 192.168.1.8
The attr and value columns can give you a picture of custom node attributes.
attr value
rack rack314
azone us-east-1
Columns
Below is an exhaustive list of the existing headers that can be
passed to nodes?h= to retrieve the relevant details in ordered
columns. If no headers are specified, then those marked to Appear
by Default will appear. If any header is specified, then the defaults
are not used.
Aliases can be used in place of the full header name for brevity.
Columns appear in the order that they are listed below unless a
different order is specified (e.g., h=attr,value versus h=value,attr).
When specifying headers, the headers are not placed in the output
by default. To have the headers appear in the output, use verbose
mode (v). The header name will match the supplied value (e.g.,
pid versus p). For example:
% curl 192.168.56.10:9200/_cat/nodeattrs?v&h=name,pid,attr,value
name pid attr value
Black Bolt 28000 rack rack314
Black Bolt 28000 azone us-east-1
| Header | Alias | Appear by Default | Description | Example |
|---|---|---|---|---|
|
|
Yes |
Name of the node |
Black Bolt |
|
|
No |
Unique node ID |
k0zy |
|
|
No |
Process ID |
13061 |
|
|
Yes |
Host name |
n1 |
|
|
Yes |
IP address |
127.0.1.1 |
|
|
No |
Bound transport port |
9300 |
|
|
Yes |
Attribute name |
rack |
|
|
Yes |
Attribute value |
rack123 |
88. cat nodes
The nodes command shows the cluster topology.
% curl 192.168.56.10:9200/_cat/nodes
SP4H 4727 192.168.56.30 9300 2.3.0 1.8.0_73 72.1gb 35.4 93.9mb 79 239.1mb 0.45 3.4h d m Boneyard
_uhJ 5134 192.168.56.10 9300 2.3.0 1.8.0_73 72.1gb 33.3 93.9mb 85 239.1mb 0.06 3.4h d * Athena
HfDp 4562 192.168.56.20 9300 2.3.0 1.8.0_73 72.2gb 74.5 93.9mb 83 239.1mb 0.12 3.4h d m Zarek
The first few columns tell you where your nodes live. For sanity it also tells you what version of ES and the JVM each one runs.
nodeId pid ip port version jdk
u2PZ 4234 192.168.56.30 9300 2.3.0 1.8.0_73
URzf 5443 192.168.56.10 9300 2.3.0 1.8.0_73
ActN 3806 192.168.56.20 9300 2.3.0 1.8.0_73
The next few give a picture of your heap, memory, and load.
diskAvail heapPercent heapMax ramPercent ramMax load
72.1gb 31.3 93.9mb 81 239.1mb 0.24
72.1gb 19.6 93.9mb 82 239.1mb 0.05
72.2gb 64.9 93.9mb 84 239.1mb 0.12
The last columns provide ancillary information that can often be useful when looking at the cluster as a whole, particularly large ones. How many master-eligible nodes do I have? How many client nodes? It looks like someone restarted a node recently; which one was it?
uptime data/client master name
3.5h d m Boneyard
3.5h d * Athena
3.5h d m Zarek
Columns
Below is an exhaustive list of the existing headers that can be
passed to nodes?h= to retrieve the relevant details in ordered
columns. If no headers are specified, then those marked to Appear
by Default will appear. If any header is specified, then the defaults
are not used.
Aliases can be used in place of the full header name for brevity.
Columns appear in the order that they are listed below unless a
different order is specified (e.g., h=pid,id versus h=id,pid).
When specifying headers, the headers are not placed in the output
by default. To have the headers appear in the output, use verbose
mode (v). The header name will match the supplied value (e.g.,
pid versus p). For example:
% curl 192.168.56.10:9200/_cat/nodes?v&h=id,ip,port,v,m
id ip port version m
pLSN 192.168.56.30 9300 2.3.0 m
k0zy 192.168.56.10 9300 2.3.0 m
6Tyi 192.168.56.20 9300 2.3.0 *
% curl 192.168.56.10:9200/_cat/nodes?h=id,ip,port,v,m
pLSN 192.168.56.30 9300 2.3.0 m
k0zy 192.168.56.10 9300 2.3.0 m
6Tyi 192.168.56.20 9300 2.3.0 *
| Header | Alias | Appear by Default | Description | Example |
|---|---|---|---|---|
|
|
No |
Unique node ID |
k0zy |
|
|
No |
Process ID |
13061 |
|
|
Yes |
Host name |
n1 |
|
|
Yes |
IP address |
127.0.1.1 |
|
|
No |
Bound transport port |
9300 |
|
|
No |
Elasticsearch version |
2.3.0 |
|
|
No |
Elasticsearch Build hash |
5c03844 |
|
|
No |
Running Java version |
1.8.0 |
|
|
No |
Available disk space |
1.8gb |
|
|
No |
Used heap |
311.2mb |
|
|
Yes |
Used heap percentage |
7 |
|
|
No |
Maximum configured heap |
1015.6mb |
|
|
No |
Used total memory |
513.4mb |
|
|
Yes |
Used total memory percentage |
47 |
|
|
No |
Total memory |
2.9gb |
|
|
No |
Used file descriptors |
123 |
|
|
Yes |
Used file descriptors percentage |
1 |
|
|
No |
Maximum number of file descriptors |
1024 |
|
|
No |
Most recent load average |
0.22 |
|
|
No |
Node uptime |
17.3m |
|
|
Yes |
Data node (d); Client node (c) |
d |
|
|
Yes |
Current master (*); master eligible (m) |
m |
|
|
Yes |
Node name |
Venom |
|
|
No |
Size of completion |
0b |
|
|
No |
Used fielddata cache memory |
0b |
|
|
No |
Fielddata cache evictions |
0 |
|
|
No |
Used query cache memory |
0b |
|
|
No |
Query cache evictions |
0 |
|
|
No |
Used request cache memory |
0b |
|
|
No |
Request cache evictions |
0 |
|
|
No |
Request cache hit count |
0 |
|
|
No |
Request cache miss count |
0 |
|
|
No |
Number of flushes |
1 |
|
|
No |
Time spent in flush |
1 |
|
|
No |
Number of current get operations |
0 |
|
|
No |
Time spent in get |
14ms |
|
|
No |
Number of get operations |
2 |
|
|
No |
Time spent in successful gets |
14ms |
|
|
No |
Number of successful get operations |
2 |
|
|
No |
Time spent in failed gets |
0s |
|
|
No |
Number of failed get operations |
1 |
|
|
No |
Number of current deletion operations |
0 |
|
|
No |
Time spent in deletions |
2ms |
|
|
No |
Number of deletion operations |
2 |
|
|
No |
Number of current indexing operations |
0 |
|
|
No |
Time spent in indexing |
134ms |
|
|
No |
Number of indexing operations |
1 |
|
|
No |
Number of current merge operations |
0 |
|
|
No |
Number of current merging documents |
0 |
|
|
No |
Size of current merges |
0b |
|
|
No |
Number of completed merge operations |
0 |
|
|
No |
Number of merged documents |
0 |
|
|
No |
Size of current merges |
0b |
|
|
No |
Time spent merging documents |
0s |
|
|
No |
Number of current percolations |
0 |
|
|
No |
Memory used by current percolations |
0b |
|
|
No |
Number of registered percolation queries |
0 |
|
|
No |
Time spent percolating |
0s |
|
|
No |
Total percolations |
0 |
|
|
No |
Number of refreshes |
16 |
|
|
No |
Time spent in refreshes |
91ms |
|
|
No |
Total script compilations |
17 |
|
|
No |
Total compiled scripts evicted from cache |
6 |
|
|
No |
Current fetch phase operations |
0 |
|
|
No |
Time spent in fetch phase |
37ms |
|
|
No |
Number of fetch operations |
7 |
|
|
No |
Open search contexts |
0 |
|
|
No |
Current query phase operations |
0 |
|
|
No |
Time spent in query phase |
43ms |
|
|
No |
Number of query operations |
9 |
|
|
No |
Open scroll contexts |
2 |
|
|
No |
Time scroll contexts held open |
2m |
|
|
No |
Completed scroll contexts |
1 |
|
|
No |
Number of segments |
4 |
|
|
No |
Memory used by segments |
1.4kb |
|
|
No |
Memory used by index writer |
18mb |
|
|
No |
Maximum memory index writer may use before it must write buffered documents to a new segment |
32mb |
|
|
No |
Memory used by version map |
1.0kb |
89. cat pending tasks
pending_tasks provides the same information as the
/_cluster/pending_tasks API in a
convenient tabular format.
% curl 'localhost:9200/_cat/pending_tasks?v'
insertOrder timeInQueue priority source
1685 855ms HIGH update-mapping [foo][t]
1686 843ms HIGH update-mapping [foo][t]
1693 753ms HIGH refresh-mapping [foo][[t]]
1688 816ms HIGH update-mapping [foo][t]
1689 802ms HIGH update-mapping [foo][t]
1690 787ms HIGH update-mapping [foo][t]
1691 773ms HIGH update-mapping [foo][t]
90. cat plugins
The plugins command provides a view per node of running plugins. This information spans nodes.
% curl 'localhost:9200/_cat/plugins?v'
name component version type isolation url
Abraxas cloud-azure 2.2.0-SNAPSHOT j x
Abraxas lang-groovy 2.2.0 j x
Abraxas lang-javascript 2.2.0-SNAPSHOT j x
Abraxas marvel NA j/s x /_plugin/marvel/
Abraxas lang-python 2.2.0-SNAPSHOT j x
Abraxas inquisitor NA s /_plugin/inquisitor/
Abraxas kopf 0.5.2 s /_plugin/kopf/
Abraxas segmentspy NA s /_plugin/segmentspy/
We can tell quickly how many plugins per node we have and which versions.
91. cat recovery
The recovery command is a view of index shard recoveries, both on-going and previously
completed. It is a more compact view of the JSON recovery API.
A recovery event occurs anytime an index shard moves to a different node in the cluster. This can happen during a snapshot recovery, a change in replication level, node failure, or on node startup. This last type is called a local store recovery and is the normal way for shards to be loaded from disk when a node starts up.
As an example, here is what the recovery state of a cluster may look like when there are no shards in transit from one node to another:
> curl -XGET 'localhost:9200/_cat/recovery?v'
index shard time type stage source target files percent bytes percent
wiki 0 73 store done hostA hostA 36 100.0% 24982806 100.0%
wiki 1 245 store done hostA hostA 33 100.0% 24501912 100.0%
wiki 2 230 store done hostA hostA 36 100.0% 30267222 100.0%
In the above case, the source and target nodes are the same because the recovery type was store, i.e. they were read from local storage on node start.
Now let’s see what a live recovery looks like. By increasing the replica count of our index and bringing another node online to host the replicas, we can see what a live shard recovery looks like.
> curl -XPUT 'localhost:9200/wiki/_settings' -d'{"number_of_replicas":1}'
{"acknowledged":true}
> curl -XGET 'localhost:9200/_cat/recovery?v'
index shard time type stage source target files percent bytes percent
wiki 0 1252 store done hostA hostA 4 100.0% 23638870 100.0%
wiki 0 1672 replica index hostA hostB 4 75.0% 23638870 48.8%
wiki 1 1698 replica index hostA hostB 4 75.0% 23348540 49.4%
wiki 1 4812 store done hostA hostA 33 100.0% 24501912 100.0%
wiki 2 1689 replica index hostA hostB 4 75.0% 28681851 40.2%
wiki 2 5317 store done hostA hostA 36 100.0% 30267222 100.0%
We can see in the above listing that our 3 initial shards are in various stages
of being replicated from one node to another. Notice that the recovery type is
shown as replica. The files and bytes copied are real-time measurements.
Finally, let’s see what a snapshot recovery looks like. Assuming I have previously made a backup of my index, I can restore it using the snapshot and restore API.
> curl -XPOST 'localhost:9200/_snapshot/imdb/snapshot_2/_restore'
{"acknowledged":true}
> curl -XGET 'localhost:9200/_cat/recovery?v'
index shard time type stage repository snapshot files percent bytes percent
imdb 0 1978 snapshot done imdb snap_1 79 8.0% 12086 9.0%
imdb 1 2790 snapshot index imdb snap_1 88 7.7% 11025 8.1%
imdb 2 2790 snapshot index imdb snap_1 85 0.0% 12072 0.0%
imdb 3 2796 snapshot index imdb snap_1 85 2.4% 12048 7.2%
imdb 4 819 snapshot init imdb snap_1 0 0.0% 0 0.0%
92. cat repositories
The repositories command shows the snapshot repositories registered in the cluster.
% curl 'localhost:9200/_cat/repositories?v'
id type
repo1 fs
repo2 s3
We can quickly see which repositories are registered and their type.
93. cat thread pool
The thread_pool command shows cluster wide thread pool statistics per node. By default the active, queue and rejected
statistics are returned for the bulk, index and search thread pools.
% curl 192.168.56.10:9200/_cat/thread_pool
host1 192.168.1.35 0 0 0 0 0 0 0 0 0
host2 192.168.1.36 0 0 0 0 0 0 0 0 0
The first two columns contain the host and ip of a node.
host ip
host1 192.168.1.35
host2 192.168.1.36
The next three columns show the active queue and rejected statistics for the bulk thread pool.
bulk.active bulk.queue bulk.rejected
0 0 0
The remaining columns show the active queue and rejected statistics of the index and search thread pool respectively.
Also other statistics of different thread pools can be retrieved by using the h (header) parameter.
% curl 'localhost:9200/_cat/thread_pool?v&h=id,host,suggest.active,suggest.rejected,suggest.completed'
host suggest.active suggest.rejected suggest.completed
host1 0 0 0
host2 0 0 0
Here the host columns and the active, rejected and completed suggest thread pool statistic are displayed. The suggest thread pool won’t be displayed by default, so you always need to be specific about what statistic you want to display.
Available Thread Pools
Currently available thread pools:
| Thread Pool | Alias | Description |
|---|---|---|
|
|
Thread pool used for bulk operations |
|
|
Thread pool used for flush operations |
|
|
Thread pool used for generic operations (e.g. background node discovery) |
|
|
Thread pool used for get operations |
|
|
|
|
|
Thread pool used for management of Elasticsearch (e.g. cluster management) |
|
|
Thread pool used for force merge operations |
|
|
Thread pool used for percolator operations |
|
|
Thread pool used for refresh operations |
|
|
|
|
|
Thread pool used for snapshot operations |
|
|
Thread pool used for suggester operations |
|
|
Thread pool used for index warm-up operations |
The thread pool name (or alias) must be combined with a thread pool field below to retrieve the requested information.
Thread Pool Fields
For each thread pool, you can load details about it by using the field names
in the table below, either using the full field name (e.g. bulk.active) or
its alias (e.g. sa is equivalent to search.active).
| Field Name | Alias | Description |
|---|---|---|
|
|
The current (*) type of thread pool ( |
|
|
The number of active threads in the current thread pool |
|
|
The number of threads in the current thread pool |
|
|
The number of tasks in the queue for the current thread pool |
|
|
The maximum number of tasks in the queue for the current thread pool |
|
|
The number of rejected threads in the current thread pool |
|
|
The highest number of active threads in the current thread pool |
|
|
The number of completed threads in the current thread pool |
|
|
The configured minimum number of active threads allowed in the current thread pool |
|
|
The configured maximum number of active threads allowed in the current thread pool |
|
|
The configured keep alive time for threads |
Other Fields
In addition to details about each thread pool, it is also convenient to get an
understanding of where those thread pools reside. As such, you can request
other details like the ip of the responding node(s).
| Field Name | Alias | Description |
|---|---|---|
|
|
The unique node ID |
|
|
The process ID of the running node |
|
|
The hostname for the current node |
|
|
The IP address for the current node |
|
|
The bound transport port for the current node |
94. cat shards
The shards command is the detailed view of what nodes contain which
shards. It will tell you if it’s a primary or replica, the number of
docs, the bytes it takes on disk, and the node where it’s located.
Here we see a single index, with three primary shards and no replicas:
% curl 192.168.56.20:9200/_cat/shards
wiki1 0 p STARTED 3014 31.1mb 192.168.56.10 Stiletto
wiki1 1 p STARTED 3013 29.6mb 192.168.56.30 Frankie Raye
wiki1 2 p STARTED 3973 38.1mb 192.168.56.20 Commander Kraken
Index pattern
If you have many shards, you may wish to limit which indices show up
in the output. You can always do this with grep, but you can save
some bandwidth by supplying an index pattern to the end.
% curl 192.168.56.20:9200/_cat/shards/wiki2
wiki2 0 p STARTED 197 3.2mb 192.168.56.10 Stiletto
wiki2 1 p STARTED 205 5.9mb 192.168.56.30 Frankie Raye
wiki2 2 p STARTED 275 7.8mb 192.168.56.20 Commander Kraken
Relocation
Let’s say you’ve checked your health and you see two relocating shards. Where are they from and where are they going?
% curl 192.168.56.10:9200/_cat/health
1384315316 20:01:56 foo green 3 3 12 6 2 0 0
% curl 192.168.56.10:9200/_cat/shards | fgrep RELO
wiki1 0 r RELOCATING 3014 31.1mb 192.168.56.20 Commander Kraken -> 192.168.56.30 Frankie Raye
wiki1 1 r RELOCATING 3013 29.6mb 192.168.56.10 Stiletto -> 192.168.56.30 Frankie Raye
Shard states
Before a shard can be used, it goes through an INITIALIZING state.
shards can show you which ones.
% curl -XPUT 192.168.56.20:9200/_settings -d'{"number_of_replicas":1}'
{"acknowledged":true}
% curl 192.168.56.20:9200/_cat/shards
wiki1 0 p STARTED 3014 31.1mb 192.168.56.10 Stiletto
wiki1 0 r INITIALIZING 0 14.3mb 192.168.56.30 Frankie Raye
wiki1 1 p STARTED 3013 29.6mb 192.168.56.30 Frankie Raye
wiki1 1 r INITIALIZING 0 13.1mb 192.168.56.20 Commander Kraken
wiki1 2 r INITIALIZING 0 14mb 192.168.56.10 Stiletto
wiki1 2 p STARTED 3973 38.1mb 192.168.56.20 Commander Kraken
If a shard cannot be assigned, for example you’ve overallocated the
number of replicas for the number of nodes in the cluster, the shard
will remain UNASSIGNED with the reason code ALLOCATION_FAILED.
% curl -XPUT 192.168.56.20:9200/_settings -d'{"number_of_replicas":3}'
% curl 192.168.56.20:9200/_cat/health
1384316325 20:18:45 foo yellow 3 3 9 3 0 0 3
% curl 192.168.56.20:9200/_cat/shards
wiki1 0 p STARTED 3014 31.1mb 192.168.56.10 Stiletto
wiki1 0 r STARTED 3014 31.1mb 192.168.56.30 Frankie Raye
wiki1 0 r STARTED 3014 31.1mb 192.168.56.20 Commander Kraken
wiki1 0 r UNASSIGNED ALLOCATION_FAILED
wiki1 1 r STARTED 3013 29.6mb 192.168.56.10 Stiletto
wiki1 1 p STARTED 3013 29.6mb 192.168.56.30 Frankie Raye
wiki1 1 r STARTED 3013 29.6mb 192.168.56.20 Commander Kraken
wiki1 1 r UNASSIGNED ALLOCATION_FAILED
wiki1 2 r STARTED 3973 38.1mb 192.168.56.10 Stiletto
wiki1 2 r STARTED 3973 38.1mb 192.168.56.30 Frankie Raye
wiki1 2 p STARTED 3973 38.1mb 192.168.56.20 Commander Kraken
wiki1 2 r UNASSIGNED ALLOCATION_FAILED
Reasons for unassigned shard
These are the possible reasons for a shard be in a unassigned state:
INDEX_CREATED
|
Unassigned as a result of an API creation of an index. |
CLUSTER_RECOVERED
|
Unassigned as a result of a full cluster recovery. |
INDEX_REOPENED
|
Unassigned as a result of opening a closed index. |
DANGLING_INDEX_IMPORTED
|
Unassigned as a result of importing a dangling index. |
NEW_INDEX_RESTORED
|
Unassigned as a result of restoring into a new index. |
EXISTING_INDEX_RESTORED
|
Unassigned as a result of restoring into a closed index. |
REPLICA_ADDED
|
Unassigned as a result of explicit addition of a replica. |
ALLOCATION_FAILED
|
Unassigned as a result of a failed allocation of the shard. |
NODE_LEFT
|
Unassigned as a result of the node hosting it leaving the cluster. |
REROUTE_CANCELLED
|
Unassigned as a result of explicit cancel reroute command. |
REINITIALIZED
|
When a shard moves from started back to initializing, for example, with shadow replicas. |
REALLOCATED_REPLICA
|
A better replica location is identified and causes the existing replica allocation to be cancelled. |
95. cat segments
The segments command provides low level information about the segments
in the shards of an index. It provides information similar to the
_segments endpoint.
% curl 'http://localhost:9200/_cat/segments?v'
index shard prirep ip segment generation docs.count [...]
test 4 p 192.168.2.105 _0 0 1
test1 2 p 192.168.2.105 _0 0 1
test1 3 p 192.168.2.105 _2 2 1
[...] docs.deleted size size.memory committed searchable version compound
0 2.9kb 7818 false true 4.10.2 true
0 2.9kb 7818 false true 4.10.2 true
0 2.9kb 7818 false true 4.10.2 true
The output shows information about index names and shard numbers in the first two columns.
If you only want to get information about segments in one particular index,
you can add the index name in the URL, for example /_cat/segments/test. Also,
several indexes can be queried like /_cat/segments/test,test1
The following columns provide additional monitoring information:
- prirep
-
Whether this segment belongs to a primary or replica shard.
- ip
-
The ip address of the segments shard.
- segment
-
A segment name, derived from the segment generation. The name is internally used to generate the file names in the directory of the shard this segment belongs to.
- generation
-
The generation number is incremented with each segment that is written. The name of the segment is derived from this generation number.
- docs.count
-
The number of non-deleted documents that are stored in this segment.
- docs.deleted
-
The number of deleted documents that are stored in this segment. It is perfectly fine if this number is greater than 0, space is going to be reclaimed when this segment gets merged.
- size
-
The amount of disk space that this segment uses.
- size.memory
-
Segments store some data into memory in order to be searchable efficiently. This column shows the number of bytes in memory that are used.
- committed
-
Whether the segment has been sync’ed on disk. Segments that are committed would survive a hard reboot. No need to worry in case of false, the data from uncommitted segments is also stored in the transaction log so that Elasticsearch is able to replay changes on the next start.
- searchable
-
True if the segment is searchable. A value of false would most likely mean that the segment has been written to disk but no refresh occurred since then to make it searchable.
- version
-
The version of Lucene that has been used to write this segment.
- compound
-
Whether the segment is stored in a compound file. When true, this means that Lucene merged all files from the segment in a single one in order to save file descriptors.
96. cat snapshots
The snapshots command shows all snapshots that belong to a specific repository.
To find a list of available repositories to query, the command /_cat/repositories can be used.
Querying the snapshots of a repository named repo1 then looks as follows.
% curl 'localhost:9200/_cat/snapshots/repo1?v'
id status start_epoch start_time end_epoch end_time duration indices successful_shards failed_shards total_shards
snap1 FAILED 1445616705 18:11:45 1445616978 18:16:18 4.6m 1 4 1 5
snap2 SUCCESS 1445634298 23:04:58 1445634672 23:11:12 6.2m 2 10 0 10
Each snapshot contains information about when it was started and stopped.
Start and stop timestamps are available in two formats.
The HH:MM:SS output is simply for quick human consumption.
The epoch time retains more information, including date, and is machine sortable if the snapshot process spans days.
Cluster APIs
Node specification
Most cluster level APIs allow to specify which nodes to execute on (for
example, getting the node stats for a node). Nodes can be identified in
the APIs either using their internal node id, the node name, address,
custom attributes, or just the _local node receiving the request. For
example, here are some sample executions of nodes info:
# Local
curl localhost:9200/_nodes/_local
# Address
curl localhost:9200/_nodes/10.0.0.3,10.0.0.4
curl localhost:9200/_nodes/10.0.0.*
# Names
curl localhost:9200/_nodes/node_name_goes_here
curl localhost:9200/_nodes/node_name_goes_*
# Attributes (set something like node.rack: 2 in the config)
curl localhost:9200/_nodes/rack:2
curl localhost:9200/_nodes/ra*:2
curl localhost:9200/_nodes/ra*:2*
97. Cluster Health
The cluster health API allows to get a very simple status on the health of the cluster.
$ curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'
{
"cluster_name" : "testcluster",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 5,
"active_shards" : 10,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards": 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch": 0,
"task_max_waiting_in_queue_millis": 0,
"active_shards_percent_as_number": 100
}
The API can also be executed against one or more indices to get just the specified indices health:
$ curl -XGET 'http://localhost:9200/_cluster/health/test1,test2'
The cluster health status is: green, yellow or red. On the shard
level, a red status indicates that the specific shard is not allocated
in the cluster, yellow means that the primary shard is allocated but
replicas are not, and green means that all shards are allocated. The
index level status is controlled by the worst shard status. The cluster
status is controlled by the worst index status.
One of the main benefits of the API is the ability to wait until the
cluster reaches a certain high water-mark health level. For example, the
following will wait for 50 seconds for the cluster to reach the yellow
level (if it reaches the green or yellow status before 50 seconds elapse,
it will return at that point):
$ curl -XGET 'http://localhost:9200/_cluster/health?wait_for_status=yellow&timeout=50s'
Request Parameters
The cluster health API accepts the following request parameters:
level-
Can be one of
cluster,indicesorshards. Controls the details level of the health information returned. Defaults tocluster. wait_for_status-
One of
green,yelloworred. Will wait (until the timeout provided) until the status of the cluster changes to the one provided or better, i.e.green>yellow>red. By default, will not wait for any status. wait_for_relocating_shards-
A number controlling to how many relocating shards to wait for. Usually will be
0to indicate to wait till all relocations have happened. Defaults to not wait. wait_for_active_shards-
A number controlling to how many active shards to wait for. Defaults to not wait.
wait_for_nodes-
The request waits until the specified number
Nof nodes is available. It also accepts>=N,<=N,>Nand<N. Alternatively, it is possible to usege(N),le(N),gt(N)andlt(N)notation. timeout-
A time based parameter controlling how long to wait if one of the wait_for_XXX are provided. Defaults to
30s. local-
If
truereturns the local node information and does not provide the state from master node. Default:false.
The following is an example of getting the cluster health at the
shards level:
$ curl -XGET 'http://localhost:9200/_cluster/health/twitter?level=shards'
98. Cluster State
The cluster state API allows to get a comprehensive state information of the whole cluster.
$ curl -XGET 'http://localhost:9200/_cluster/state'
By default, the cluster state request is routed to the master node, to
ensure that the latest cluster state is returned.
For debugging purposes, you can retrieve the cluster state local to a
particular node by adding local=true to the query string.
Response Filters
As the cluster state can grow (depending on the number of shards and indices, your mapping, templates), it is possible to filter the cluster state response specifying the parts in the URL.
$ curl -XGET 'http://localhost:9200/_cluster/state/{metrics}/{indices}'
metrics can be a comma-separated list of
version-
Shows the cluster state version.
master_node-
Shows the elected
master_nodepart of the response nodes-
Shows the
nodespart of the response routing_table-
Shows the
routing_tablepart of the response. If you supply a comma separated list of indices, the returned output will only contain the indices listed. metadata-
Shows the
metadatapart of the response. If you supply a comma separated list of indices, the returned output will only contain the indices listed. blocks-
Shows the
blockspart of the response
A couple of example calls:
# return only metadata and routing_table data for specified indices
$ curl -XGET 'http://localhost:9200/_cluster/state/metadata,routing_table/foo,bar'
# return everything for these two indices
$ curl -XGET 'http://localhost:9200/_cluster/state/_all/foo,bar'
# Return only blocks data
$ curl -XGET 'http://localhost:9200/_cluster/state/blocks'
99. Cluster Stats
The Cluster Stats API allows to retrieve statistics from a cluster wide perspective. The API returns basic index metrics (shard numbers, store size, memory usage) and information about the current nodes that form the cluster (number, roles, os, jvm versions, memory usage, cpu and installed plugins).
curl -XGET 'http://localhost:9200/_cluster/stats?human&pretty'
Will return, for example:
{
"timestamp": 1439326129256,
"cluster_name": "elasticsearch",
"status": "green",
"indices": {
"count": 3,
"shards": {
"total": 35,
"primaries": 15,
"replication": 1.333333333333333,
"index": {
"shards": {
"min": 10,
"max": 15,
"avg": 11.66666666666666
},
"primaries": {
"min": 5,
"max": 5,
"avg": 5
},
"replication": {
"min": 1,
"max": 2,
"avg": 1.3333333333333333
}
}
},
"docs": {
"count": 2,
"deleted": 0
},
"store": {
"size": "5.6kb",
"size_in_bytes": 5770,
"throttle_time": "0s",
"throttle_time_in_millis": 0
},
"fielddata": {
"memory_size": "0b",
"memory_size_in_bytes": 0,
"evictions": 0
},
"query_cache": {
"memory_size": "0b",
"memory_size_in_bytes": 0,
"evictions": 0
},
"completion": {
"size": "0b",
"size_in_bytes": 0
},
"segments": {
"count": 2,
"memory": "6.4kb",
"memory_in_bytes": 6596,
"index_writer_memory": "0b",
"index_writer_memory_in_bytes": 0,
"index_writer_max_memory": "275.7mb",
"index_writer_max_memory_in_bytes": 289194639,
"version_map_memory": "0b",
"version_map_memory_in_bytes": 0,
"fixed_bit_set": "0b",
"fixed_bit_set_memory_in_bytes": 0
},
"percolate": {
"total": 0,
"get_time": "0s",
"time_in_millis": 0,
"current": 0,
"memory_size_in_bytes": -1,
"memory_size": "-1b",
"queries": 0
}
},
"nodes": {
"count": {
"total": 2,
"master_only": 0,
"data_only": 0,
"master_data": 2,
"client": 0
},
"versions": [
"2.3.0"
],
"os": {
"available_processors": 4,
"mem": {
"total": "8gb",
"total_in_bytes": 8589934592
},
"names": [
{
"name": "Mac OS X",
"count": 1
}
],
"cpu": [
{
"vendor": "Intel",
"model": "MacBookAir5,2",
"mhz": 2000,
"total_cores": 4,
"total_sockets": 4,
"cores_per_socket": 16,
"cache_size": "256b",
"cache_size_in_bytes": 256,
"count": 1
}
]
},
"process": {
"cpu": {
"percent": 3
},
"open_file_descriptors": {
"min": 200,
"max": 346,
"avg": 273
}
},
"jvm": {
"max_uptime": "24s",
"max_uptime_in_millis": 24054,
"versions": [
{
"version": "1.6.0_45",
"vm_name": "Java HotSpot(TM) 64-Bit Server VM",
"vm_version": "20.45-b01-451",
"vm_vendor": "Apple Inc.",
"count": 2
}
],
"mem": {
"heap_used": "38.3mb",
"heap_used_in_bytes": 40237120,
"heap_max": "1.9gb",
"heap_max_in_bytes": 2130051072
},
"threads": 89
},
"fs":
{
"total": "232.9gb",
"total_in_bytes": 250140434432,
"free": "31.3gb",
"free_in_bytes": 33705881600,
"available": "31.1gb",
"available_in_bytes": 33443737600,
"disk_reads": 21202753,
"disk_writes": 27028840,
"disk_io_op": 48231593,
"disk_read_size": "528gb",
"disk_read_size_in_bytes": 566980806656,
"disk_write_size": "617.9gb",
"disk_write_size_in_bytes": 663525366784,
"disk_io_size": "1145.9gb",
"disk_io_size_in_bytes": 1230506173440
},
"plugins": [
// all plugins installed on nodes
{
"name": "inquisitor",
"description": "",
"url": "/_plugin/inquisitor/",
"jvm": false,
"site": true
}
]
}
}
100. Pending cluster tasks
The pending cluster tasks API returns a list of any cluster-level changes (e.g. create index, update mapping, allocate or fail shard) which have not yet been executed.
|
|
This API returns a list of any pending updates to the cluster state. These are distinct from the tasks reported by the Task Management API which include periodic tasks and tasks initiated by the user, such as node stats, search queries, or create index requests. However, if a user-initiated task such as a create index command causes a cluster state update, the activity of this task might be reported by both task api and pending cluster tasks API. |
$ curl -XGET 'http://localhost:9200/_cluster/pending_tasks'
Usually this will return an empty list as cluster-level changes are usually fast. However if there are tasks queued up, the output will look something like this:
{
"tasks": [
{
"insert_order": 101,
"priority": "URGENT",
"source": "create-index [foo_9], cause [api]",
"time_in_queue_millis": 86,
"time_in_queue": "86ms"
},
{
"insert_order": 46,
"priority": "HIGH",
"source": "shard-started ([foo_2][1], node[tMTocMvQQgGCkj7QDHl3OA], [P], s[INITIALIZING]), reason [after recovery from shard_store]",
"time_in_queue_millis": 842,
"time_in_queue": "842ms"
},
{
"insert_order": 45,
"priority": "HIGH",
"source": "shard-started ([foo_2][0], node[tMTocMvQQgGCkj7QDHl3OA], [P], s[INITIALIZING]), reason [after recovery from shard_store]",
"time_in_queue_millis": 858,
"time_in_queue": "858ms"
}
]
}
101. Cluster Reroute
The reroute command allows to explicitly execute a cluster reroute allocation command including specific commands. For example, a shard can be moved from one node to another explicitly, an allocation can be canceled, or an unassigned shard can be explicitly allocated on a specific node.
Here is a short example of how a simple reroute API call:
curl -XPOST 'localhost:9200/_cluster/reroute' -d '{
"commands" : [ {
"move" :
{
"index" : "test", "shard" : 0,
"from_node" : "node1", "to_node" : "node2"
}
},
{
"allocate" : {
"index" : "test", "shard" : 1, "node" : "node3"
}
}
]
}'
An important aspect to remember is the fact that once when an allocation
occurs, the cluster will aim at re-balancing its state back to an even
state. For example, if the allocation includes moving a shard from
node1 to node2, in an even state, then another shard will be moved
from node2 to node1 to even things out.
The cluster can be set to disable allocations, which means that only the explicitly allocations will be performed. Obviously, only once all commands has been applied, the cluster will aim to be re-balance its state.
Another option is to run the commands in dry_run (as a URI flag, or in
the request body). This will cause the commands to apply to the current
cluster state, and return the resulting cluster after the commands (and
re-balancing) has been applied.
If the explain parameter is specified, a detailed explanation of why the
commands could or could not be executed is returned.
The commands supported are:
move-
Move a started shard from one node to another node. Accepts
indexandshardfor index name and shard number,from_nodefor the node to move the shardfrom, andto_nodefor the node to move the shard to. cancel-
Cancel allocation of a shard (or recovery). Accepts
indexandshardfor index name and shard number, andnodefor the node to cancel the shard allocation on. It also acceptsallow_primaryflag to explicitly specify that it is allowed to cancel allocation for a primary shard. This can be used to force resynchronization of existing replicas from the primary shard by cancelling them and allowing them to be reinitialized through the standard reallocation process. allocate-
Allocate an unassigned shard to a node. Accepts the
indexandshardfor index name and shard number, andnodeto allocate the shard to. It also acceptsallow_primaryflag to explicitly specify that it is allowed to explicitly allocate a primary shard (might result in data loss).
|
|
The allow_primary parameter will force a new empty primary shard
to be allocated without any data. If a node which has a copy of the original
shard (including data) rejoins the cluster later on, that data will be
deleted: the old shard copy will be replaced by the new live shard copy.
|
102. Cluster Update Settings
Allows to update cluster wide specific settings. Settings updated can either be persistent (applied cross restarts) or transient (will not survive a full cluster restart). Here is an example:
curl -XPUT localhost:9200/_cluster/settings -d '{
"persistent" : {
"discovery.zen.minimum_master_nodes" : 2
}
}'
Or:
curl -XPUT localhost:9200/_cluster/settings -d '{
"transient" : {
"discovery.zen.minimum_master_nodes" : 2
}
}'
The cluster responds with the settings updated. So the response for the last example will be:
{
"persistent" : {},
"transient" : {
"discovery.zen.minimum_master_nodes" : "2"
}
}'
Cluster wide settings can be returned using:
curl -XGET localhost:9200/_cluster/settings
Precedence of settings
Transient cluster settings take precedence over persistent cluster settings,
which take precedence over settings configured in the elasticsearch.yml
config file.
For this reason it is preferrable to use the elasticsearch.yml file only
for local configurations, and set all cluster-wider settings with the
settings API.
A list of dynamically updatable settings can be found in the Modules documentation.
103. Nodes Stats
Nodes statistics
The cluster nodes stats API allows to retrieve one or more (or all) of the cluster nodes statistics.
curl -XGET 'http://localhost:9200/_nodes/stats'
curl -XGET 'http://localhost:9200/_nodes/nodeId1,nodeId2/stats'
The first command retrieves stats of all the nodes in the cluster. The
second command selectively retrieves nodes stats of only nodeId1 and
nodeId2. All the nodes selective options are explained
here.
By default, all stats are returned. You can limit this by combining any
of indices, os, process, jvm, transport, http,
fs, breaker and thread_pool. For example:
indices
|
Indices stats about size, document count, indexing and deletion times, search times, field cache size, merges and flushes |
fs
|
File system information, data path, free disk space, read/write stats (see FS information) |
http
|
HTTP connection information |
jvm
|
JVM stats, memory pool information, garbage collection, buffer pools, number of loaded/unloaded classes |
os
|
Operating system stats, load average, mem, swap (see OS statistics) |
process
|
Process statistics, memory consumption, cpu usage, open file descriptors (see Process statistics) |
thread_pool
|
Statistics about each thread pool, including current size, queue and rejected tasks |
transport
|
Transport statistics about sent and received bytes in cluster communication |
breaker
|
Statistics about the field data circuit breaker |
# return indices and os
curl -XGET 'http://localhost:9200/_nodes/stats/os'
# return just os and process
curl -XGET 'http://localhost:9200/_nodes/stats/os,process'
# specific type endpoint
curl -XGET 'http://localhost:9200/_nodes/stats/process'
curl -XGET 'http://localhost:9200/_nodes/10.0.0.1/stats/process'
The all flag can be set to return all the stats.
FS information
The fs flag can be set to retrieve
information that concern the file system:
fs.timestamp-
Last time the file stores statistics have been refreshed
fs.total.total_in_bytes-
Total size (in bytes) of all file stores
fs.total.free_in_bytes-
Total number of unallocated bytes in all file stores
fs.total.available_in_bytes-
Total number of bytes available to this Java virtual machine on all file stores
fs.data-
List of all file stores
fs.data.path-
Path to the file store
fs.data.mount-
Mount point of the file store (ex: /dev/sda2)
fs.data.type-
Type of the file store (ex: ext4)
fs.data.total_in_bytes-
Total size (in bytes) of the file store
fs.data.free_in_bytes-
Total number of unallocated bytes in the file store
fs.data.available_in_bytes-
Total number of bytes available to this Java virtual machine on this file store
fs.data.spins(Linux only)-
Indicates if the file store is backed by spinning storage.
nullmeans we could not determine it,truemeans the device possibly spins andfalsemeans it does not (ex: solid-state disks).
Operating System statistics
The os flag can be set to retrieve statistics that concern
the operating system:
os.timestamp-
Last time the operating system statistics have been refreshed
os.percent-
Recent CPU usage for the whole system, or -1 if not supported
os.load_average-
System load average for the last minute, or -1 if not supported
os.mem.total_in_bytes-
Total amount of physical memory in bytes
os.mem.free_in_bytes-
Amount of free physical memory in bytes
os.mem.free_percent-
Percentage of free memory
os.mem.used_in_bytes-
Amount of used physical memory in bytes
os.mem.used_percent-
Percentage of used memory
os.swap.total_in_bytes-
Total amount of swap space in bytes
os.swap.free_in_bytes-
Amount of free swap space in bytes
os.swap.used_in_bytes-
Amount of used swap space in bytes
Process statistics
The process flag can be set to retrieve statistics that concern
the current running process:
process.timestamp-
Last time the process statistics have been refreshed
process.open_file_descriptors-
Number of opened file descriptors associated with the current process, or -1 if not supported
process.max_file_descriptors-
Maximum number of file descriptors allowed on the system, or -1 if not supported
process.cpu.percent-
CPU usage in percent, or -1 if not known at the time the stats are computed
process.cpu.total_in_millis-
CPU time (in milliseconds) used by the process on which the Java virtual machine is running, or -1 if not supported
process.mem.total_virtual_in_bytes-
Size in bytes of virtual memory that is guaranteed to be available to the running process
Field data statistics
You can get information about field data memory usage on node level or on index level.
# Node Stats
curl -XGET 'http://localhost:9200/_nodes/stats/indices/?fields=field1,field2&pretty'
# Indices Stat
curl -XGET 'http://localhost:9200/_stats/fielddata/?fields=field1,field2&pretty'
# You can use wildcards for field names
curl -XGET 'http://localhost:9200/_stats/fielddata/?fields=field*&pretty'
curl -XGET 'http://localhost:9200/_nodes/stats/indices/?fields=field*&pretty'
Search groups
You can get statistics about search groups for searches executed on this node.
# All groups with all stats
curl -XGET 'http://localhost:9200/_nodes/stats?pretty&groups=_all'
# Some groups from just the indices stats
curl -XGET 'http://localhost:9200/_nodes/stats/indices?pretty&groups=foo,bar'
104. Nodes Info
The cluster nodes info API allows to retrieve one or more (or all) of the cluster nodes information.
curl -XGET 'http://localhost:9200/_nodes'
curl -XGET 'http://localhost:9200/_nodes/nodeId1,nodeId2'
The first command retrieves information of all the nodes in the cluster.
The second command selectively retrieves nodes information of only
nodeId1 and nodeId2. All the nodes selective options are explained
here.
By default, it just returns all attributes and core settings for a node.
It also allows to get only information on settings, os, process, jvm,
thread_pool, transport, http and plugins:
curl -XGET 'http://localhost:9200/_nodes/process'
curl -XGET 'http://localhost:9200/_nodes/_all/process'
curl -XGET 'http://localhost:9200/_nodes/nodeId1,nodeId2/jvm,process'
# same as above
curl -XGET 'http://localhost:9200/_nodes/nodeId1,nodeId2/info/jvm,process'
curl -XGET 'http://localhost:9200/_nodes/nodeId1,nodeId2/_all
The _all flag can be set to return all the information - or you can simply omit it.
Operating System information
The os flag can be set to retrieve information that concern
the operating system:
os.refresh_interval_in_millis-
Refresh interval for the OS statistics
os.name-
Name of the operating system (ex: Linux, Windows, Mac OS X)
os.arch-
Name of the JVM architecture (ex: amd64, x86)
os.version-
Version of the operating system
os.available_processors-
Number of processors available to the Java virtual machine
os.allocated_processors-
The number of processors actually used to calculate thread pool size. This number can be set with the
processorssetting of a node and defaults to the number of processors reported by the OS. In both cases this number will never be larger than 32.
Process information
The process flag can be set to retrieve information that concern
the current running process:
process.refresh_interval_in_millis-
Refresh interval for the process statistics
process.id-
Process identifier (PID)
process.mlockall-
Indicates if the process address space has been successfully locked in memory
Plugins information
plugins - if set, the result will contain details about the loaded
plugins per node:
-
name: plugin name -
description: plugin description if any -
site:trueif the plugin is a site plugin -
jvm:trueif the plugin is a plugin running in the JVM -
url: URL if the plugin is a site plugin
The result will look similar to:
{
"cluster_name" : "test-cluster-MacBook-Air-de-David.local",
"nodes" : {
"hJLXmY_NTrCytiIMbX4_1g" : {
"name" : "node4",
"transport_address" : "inet[/172.18.58.139:9303]",
"hostname" : "MacBook-Air-de-David.local",
"version" : "0.90.0.Beta2-SNAPSHOT",
"http_address" : "inet[/172.18.58.139:9203]",
"plugins" : [ {
"name" : "test-plugin",
"description" : "test-plugin description",
"site" : true,
"jvm" : false
}, {
"name" : "test-no-version-plugin",
"description" : "test-no-version-plugin description",
"site" : true,
"jvm" : false
}, {
"name" : "dummy",
"description" : "No description found for dummy.",
"url" : "/_plugin/dummy/",
"site" : false,
"jvm" : true
} ]
}
}
}
105. Task Management API
experimental[The Task Management API is new and should still be considered experimental. The API may change in ways that are not backwards compatible]
Current Tasks Information
The task management API allows to retrieve information about the tasks currently executing on one or more nodes in the cluster.
GET /_tasks
GET /_tasks?nodes=nodeId1,nodeId2
GET /_tasks?nodes=nodeId1,nodeId2&actions=cluster:* 
| Retrieves all tasks currently running on all nodes in the cluster. | |
Retrieves all tasks running on nodes nodeId1 and nodeId2. See Node specification for more info about how to select individual nodes. |
|
Retrieves all cluster-related tasks running on nodes nodeId1 and nodeId2. |
The result will look similar to the following:
{
"nodes" : {
"oTUltX4IQMOUUVeiohTt8A" : {
"name" : "Tamara Rahn",
"transport_address" : "127.0.0.1:9300",
"host" : "127.0.0.1",
"ip" : "127.0.0.1:9300",
"tasks" : {
"oTUltX4IQMOUUVeiohTt8A:124" : {
"node" : "oTUltX4IQMOUUVeiohTt8A",
"id" : 124,
"type" : "direct",
"action" : "cluster:monitor/tasks/lists[n]",
"start_time_in_millis" : 1458585884904,
"running_time_in_nanos" : 47402,
"parent_task_id" : "oTUltX4IQMOUUVeiohTt8A:123"
},
"oTUltX4IQMOUUVeiohTt8A:123" : {
"node" : "oTUltX4IQMOUUVeiohTt8A",
"id" : 123,
"type" : "transport",
"action" : "cluster:monitor/tasks/lists",
"start_time_in_millis" : 1458585884904,
"running_time_in_nanos" : 236042
}
}
}
}
}
It is also possible to retrieve information for a particular task, or all children of a particular tasks using the following two commands:
GET /_tasks/taskId1
GET /_tasks?parent_task_id=parentTaskId1
The task API can be also used to wait for completion of a particular task. The following call will
block for 10 seconds or until the task with id oTUltX4IQMOUUVeiohTt8A:12345 is completed.
GET /_tasks/oTUltX4IQMOUUVeiohTt8A:12345?wait_for_completion=true&timeout=10s
Task Cancellation
If a long-running task supports cancellation, it can be cancelled by the following command:
POST /_tasks/taskId1/_cancel
The task cancellation command supports the same task selection parameters as the list tasks command, so multiple tasks
can be cancelled at the same time. For example, the following command will cancel all reindex tasks running on the
nodes nodeId1 and nodeId2.
POST /_tasks/_cancel?node_id=nodeId1,nodeId2&actions=*reindex
106. Nodes hot_threads
An API allowing to get the current hot threads on each node in the
cluster. Endpoints are /_nodes/hot_threads, and
/_nodes/{nodesIds}/hot_threads.
The output is plain text with a breakdown of each node’s top hot threads. Parameters allowed are:
threads
|
number of hot threads to provide, defaults to 3. |
interval
|
the interval to do the second sampling of threads. Defaults to 500ms. |
type
|
The type to sample, defaults to cpu, but supports wait and block to see hot threads that are in wait or block state. |
ignore_idle_threads
|
If true, known idle threads (e.g. waiting in a socket select, or to get a task from an empty queue) are filtered out. Defaults to true. |
Query DSL
Elasticsearch provides a full Query DSL based on JSON to define queries. Think of the Query DSL as an AST of queries, consisting of two types of clauses:
- Leaf query clauses
-
Leaf query clauses look for a particular value in a particular field, such as the
match,termorrangequeries. These queries can be used by themselves. - Compound query clauses
-
Compound query clauses wrap other leaf or compound queries and are used to combine multiple queries in a logical fashion (such as the
boolordis_maxquery), or to alter their behaviour (such as thenotorconstant_scorequery).
Query clauses behave differently depending on whether they are used in query context or filter context.
107. Query and filter context
The behaviour of a query clause depends on whether it is used in query context or in filter context:
- Query context
-
A query clause used in query context answers the question “How well does this document match this query clause?” Besides deciding whether or not the document matches, the query clause also calculates a
_scorerepresenting how well the document matches, relative to other documents.Query context is in effect whenever a query clause is passed to a
queryparameter, such as thequeryparameter in thesearchAPI. - Filter context
-
In filter context, a query clause answers the question “Does this document match this query clause?” The answer is a simple Yes or No — no scores are calculated. Filter context is mostly used for filtering structured data, e.g.
-
Does this
timestampfall into the range 2015 to 2016? -
Is the
statusfield set to"published"?
Frequently used filters will be cached automatically by Elasticsearch, to speed up performance.
Filter context is in effect whenever a query clause is passed to a
filterparameter, such as thefilterormust_notparameters in theboolquery, thefilterparameter in theconstant_scorequery, or thefilteraggregation. -
Below is an example of query clauses being used in query and filter context
in the search API. This query will match documents where all of the following
conditions are met:
-
The
titlefield contains the wordsearch. -
The
contentfield contains the wordelasticsearch. -
The
statusfield contains the exact wordpublished. -
The
publish_datefield contains a date from 1 Jan 2015 onwards.
GET _search
{
"query": {
"bool": {
"must": [
{ "match": { "title": "Search" }},
{ "match": { "content": "Elasticsearch" }}
],
"filter": [
{ "term": { "status": "published" }},
{ "range": { "publish_date": { "gte": "2015-01-01" }}}
]
}
}
}
The query parameter indicates query context. |
|
The bool and two match clauses are used in query context,
which means that they are used to score how well each document
matches. |
|
The filter parameter indicates filter context. |
|
The term and range clauses are used in filter context.
They will filter out documents which do not match, but they will
not affect the score for matching documents. |
|
|
Use query clauses in query context for conditions which should affect the score of matching documents (i.e. how well does the document match), and use all other query clauses in filter context. |
108. Match All Query
The most simple query, which matches all documents, giving them all a _score
of 1.0.
{ "match_all": {} }
The _score can be changed with the boost parameter:
{ "match_all": { "boost" : 1.2 }}
109. Full text queries
The high-level full text queries are usually used for running full text
queries on full text fields like the body of an email. They understand how the
field being queried is analyzed and will apply each field’s
analyzer (or search_analyzer) to the query string before executing.
The queries in this group are:
matchquery-
The standard query for performing full text queries, including fuzzy matching and phrase or proximity queries.
multi_matchquery-
The multi-field version of the
matchquery. common_termsquery-
A more specialized query which gives more preference to uncommon words.
query_stringquery-
Supports the compact Lucene query string syntax, allowing you to specify AND|OR|NOT conditions and multi-field search within a single query string. For expert users only.
simple_query_string-
A simpler, more robust version of the
query_stringsyntax suitable for exposing directly to users.
109.1. Match Query
A family of match queries that accepts text/numerics/dates, analyzes
them, and constructs a query. For example:
{
"match" : {
"message" : "this is a test"
}
}
Note, message is the name of a field, you can substitute the name of
any field (including _all) instead.
There are three types of match query: boolean, phrase, and phrase_prefix:
109.1.1. boolean
The default match query is of type boolean. It means that the text
provided is analyzed and the analysis process constructs a boolean query
from the provided text. The operator flag can be set to or or and
to control the boolean clauses (defaults to or). The minimum number of
optional should clauses to match can be set using the
minimum_should_match
parameter.
The analyzer can be set to control which analyzer will perform the
analysis process on the text. It defaults to the field explicit mapping
definition, or the default search analyzer.
The lenient parameter can be set to true to ignore exceptions caused by
data-type mismatches, such as trying to query a numeric field with a text
query string. Defaults to false.
Fuzziness
fuzziness allows fuzzy matching based on the type of field being queried.
See Fuzziness for allowed settings.
The prefix_length and
max_expansions can be set in this case to control the fuzzy process.
If the fuzzy option is set the query will use top_terms_blended_freqs_${max_expansions}
as its rewrite
method the fuzzy_rewrite parameter allows to control how the query will get
rewritten.
Here is an example when providing additional parameters (note the slight
change in structure, message is the field name):
{
"match" : {
"message" : {
"query" : "this is a test",
"operator" : "and"
}
}
}
Zero terms query
If the analyzer used removes all tokens in a query like a stop filter
does, the default behavior is to match no documents at all. In order to
change that the zero_terms_query option can be used, which accepts
none (default) and all which corresponds to a match_all query.
{
"match" : {
"message" : {
"query" : "to be or not to be",
"operator" : "and",
"zero_terms_query": "all"
}
}
}
Cutoff frequency
The match query supports a cutoff_frequency that allows
specifying an absolute or relative document frequency where high
frequency terms are moved into an optional subquery and are only scored
if one of the low frequency (below the cutoff) terms in the case of an
or operator or all of the low frequency terms in the case of an and
operator match.
This query allows handling stopwords dynamically at runtime, is domain
independent and doesn’t require a stopword file. It prevents scoring /
iterating high frequency terms and only takes the terms into account if a
more significant / lower frequency term matches a document. Yet, if all
of the query terms are above the given cutoff_frequency the query is
automatically transformed into a pure conjunction (and) query to
ensure fast execution.
The cutoff_frequency can either be relative to the total number of
documents if in the range [0..1) or absolute if greater or equal to
1.0.
Here is an example showing a query composed of stopwords exclusively:
{
"match" : {
"message" : {
"query" : "to be or not to be",
"cutoff_frequency" : 0.001
}
}
}
|
|
The cutoff_frequency option operates on a per-shard-level. This means
that when trying it out on test indexes with low document numbers you
should follow the advice in Relevance is broken.
|
109.1.2. phrase
The match_phrase query analyzes the text and creates a phrase query
out of the analyzed text. For example:
{
"match_phrase" : {
"message" : "this is a test"
}
}
Since match_phrase is only a type of a match query, it can also be
used in the following manner:
{
"match" : {
"message" : {
"query" : "this is a test",
"type" : "phrase"
}
}
}
A phrase query matches terms up to a configurable slop
(which defaults to 0) in any order. Transposed terms have a slop of 2.
The analyzer can be set to control which analyzer will perform the
analysis process on the text. It defaults to the field explicit mapping
definition, or the default search analyzer, for example:
{
"match_phrase" : {
"message" : {
"query" : "this is a test",
"analyzer" : "my_analyzer"
}
}
}
109.1.3. match_phrase_prefix
The match_phrase_prefix is the same as match_phrase, except that it
allows for prefix matches on the last term in the text. For example:
{
"match_phrase_prefix" : {
"message" : "this is a test"
}
}
Or:
{
"match" : {
"message" : {
"query" : "this is a test",
"type" : "phrase_prefix"
}
}
}
It accepts the same parameters as the phrase type. In addition, it also
accepts a max_expansions parameter that can control to how many
prefixes the last term will be expanded. It is highly recommended to set
it to an acceptable value to control the execution time of the query.
For example:
{
"match_phrase_prefix" : {
"message" : {
"query" : "this is a test",
"max_expansions" : 10
}
}
}
109.2. Multi Match Query
The multi_match query builds on the match query
to allow multi-field queries:
{
"multi_match" : {
"query": "this is a test",
"fields": [ "subject", "message" ]
}
}
| The query string. | |
| The fields to be queried. |
fields and per-field boosting
Fields can be specified with wildcards, eg:
{
"multi_match" : {
"query": "Will Smith",
"fields": [ "title", "*_name" ]
}
}
Query the title, first_name and last_name fields. |
Individual fields can be boosted with the caret (^) notation:
{
"multi_match" : {
"query" : "this is a test",
"fields" : [ "subject^3", "message" ]
}
}
The subject field is three times as important as the message field. |
Types of multi_match query:
The way the multi_match query is executed internally depends on the type
parameter, which can be set to:
best_fields
|
(default) Finds documents which match any field, but
uses the |
most_fields
|
Finds documents which match any field and combines
the |
cross_fields
|
Treats fields with the same |
phrase
|
Runs a |
phrase_prefix
|
Runs a |
109.2.1. best_fields
The best_fields type is most useful when you are searching for multiple
words best found in the same field. For instance “brown fox” in a single
field is more meaningful than “brown” in one field and “fox” in the other.
The best_fields type generates a match query for
each field and wraps them in a dis_max query, to
find the single best matching field. For instance, this query:
{
"multi_match" : {
"query": "brown fox",
"type": "best_fields",
"fields": [ "subject", "message" ],
"tie_breaker": 0.3
}
}
would be executed as:
{
"dis_max": {
"queries": [
{ "match": { "subject": "brown fox" }},
{ "match": { "message": "brown fox" }}
],
"tie_breaker": 0.3
}
}
Normally the best_fields type uses the score of the single best matching
field, but if tie_breaker is specified, then it calculates the score as
follows:
-
the score from the best matching field
-
plus
tie_breaker * _scorefor all other matching fields
Also, accepts analyzer, boost, operator, minimum_should_match,
fuzziness, prefix_length, max_expansions, rewrite, zero_terms_query
and cutoff_frequency, as explained in match query.
|
|
operator and minimum_should_matchThe Take this query for example:
This query is executed as: (+first_name:will +first_name:smith) | (+last_name:will +last_name:smith) In other words, all terms must be present in a single field for a document to match. See |
109.2.2. most_fields
The most_fields type is most useful when querying multiple fields that
contain the same text analyzed in different ways. For instance, the main
field may contain synonyms, stemming and terms without diacritics. A second
field may contain the original terms, and a third field might contain
shingles. By combining scores from all three fields we can match as many
documents as possible with the main field, but use the second and third fields
to push the most similar results to the top of the list.
This query:
{
"multi_match" : {
"query": "quick brown fox",
"type": "most_fields",
"fields": [ "title", "title.original", "title.shingles" ]
}
}
would be executed as:
{
"bool": {
"should": [
{ "match": { "title": "quick brown fox" }},
{ "match": { "title.original": "quick brown fox" }},
{ "match": { "title.shingles": "quick brown fox" }}
]
}
}
The score from each match clause is added together, then divided by the
number of match clauses.
Also, accepts analyzer, boost, operator, minimum_should_match,
fuzziness, prefix_length, max_expansions, rewrite, zero_terms_query
and cutoff_frequency, as explained in match query, but
see operator and minimum_should_match.
109.2.3. phrase and phrase_prefix
The phrase and phrase_prefix types behave just like best_fields,
but they use a match_phrase or match_phrase_prefix query instead of a
match query.
This query:
{
"multi_match" : {
"query": "quick brown f",
"type": "phrase_prefix",
"fields": [ "subject", "message" ]
}
}
would be executed as:
{
"dis_max": {
"queries": [
{ "match_phrase_prefix": { "subject": "quick brown f" }},
{ "match_phrase_prefix": { "message": "quick brown f" }}
]
}
}
Also, accepts analyzer, boost, slop and zero_terms_query as explained
in Match Query. Type phrase_prefix additionally accepts
max_expansions.
109.2.4. cross_fields
The cross_fields type is particularly useful with structured documents where
multiple fields should match. For instance, when querying the first_name
and last_name fields for “Will Smith”, the best match is likely to have
“Will” in one field and “Smith” in the other.
One way of dealing with these types of queries is simply to index the
first_name and last_name fields into a single full_name field. Of
course, this can only be done at index time.
The cross_field type tries to solve these problems at query time by taking a
term-centric approach. It first analyzes the query string into individual
terms, then looks for each term in any of the fields, as though they were one
big field.
A query like:
{
"multi_match" : {
"query": "Will Smith",
"type": "cross_fields",
"fields": [ "first_name", "last_name" ],
"operator": "and"
}
}
is executed as:
+(first_name:will last_name:will) +(first_name:smith last_name:smith)
In other words, all terms must be present in at least one field for a
document to match. (Compare this to
the logic used for best_fields and most_fields.)
That solves one of the two problems. The problem of differing term frequencies is solved by blending the term frequencies for all fields in order to even out the differences.
In practice, first_name:smith will be treated as though it has the same
frequencies as last_name:smith, plus one. This will make matches on
first_name and last_name have comparable scores, with a tiny advantage
for last_name since it is the most likely field that contains smith.
Note that cross_fields is usually only useful on short string fields
that all have a boost of 1. Otherwise boosts, term freqs and length
normalization contribute to the score in such a way that the blending of term
statistics is not meaningful anymore.
If you run the above query through the Validate API, it returns this explanation:
+blended("will", fields: [first_name, last_name])
+blended("smith", fields: [first_name, last_name])
Also, accepts analyzer, boost, operator, minimum_should_match,
zero_terms_query and cutoff_frequency, as explained in
match query.
cross_field and analysis
The cross_field type can only work in term-centric mode on fields that have
the same analyzer. Fields with the same analyzer are grouped together as in
the example above. If there are multiple groups, they are combined with a
bool query.
For instance, if we have a first and last field which have
the same analyzer, plus a first.edge and last.edge which
both use an edge_ngram analyzer, this query:
{
"multi_match" : {
"query": "Jon",
"type": "cross_fields",
"fields": [
"first", "first.edge",
"last", "last.edge"
]
}
}
would be executed as:
blended("jon", fields: [first, last])
| (
blended("j", fields: [first.edge, last.edge])
blended("jo", fields: [first.edge, last.edge])
blended("jon", fields: [first.edge, last.edge])
)
In other words, first and last would be grouped together and
treated as a single field, and first.edge and last.edge would be
grouped together and treated as a single field.
Having multiple groups is fine, but when combined with operator or
minimum_should_match, it can suffer from the same problem
as most_fields or best_fields.
You can easily rewrite this query yourself as two separate cross_fields
queries combined with a bool query, and apply the minimum_should_match
parameter to just one of them:
{
"bool": {
"should": [
{
"multi_match" : {
"query": "Will Smith",
"type": "cross_fields",
"fields": [ "first", "last" ],
"minimum_should_match": "50%"
}
},
{
"multi_match" : {
"query": "Will Smith",
"type": "cross_fields",
"fields": [ "*.edge" ]
}
}
]
}
}
Either will or smith must be present in either of the first
or last fields |
You can force all fields into the same group by specifying the analyzer
parameter in the query.
{
"multi_match" : {
"query": "Jon",
"type": "cross_fields",
"analyzer": "standard",
"fields": [ "first", "last", "*.edge" ]
}
}
Use the standard analyzer for all fields. |
which will be executed as:
blended("will", fields: [first, first.edge, last.edge, last])
blended("smith", fields: [first, first.edge, last.edge, last])
tie_breaker
By default, each per-term blended query will use the best score returned by
any field in a group, then these scores are added together to give the final
score. The tie_breaker parameter can change the default behaviour of the
per-term blended queries. It accepts:
0.0
|
Take the single best score out of (eg) |
1.0
|
Add together the scores for (eg) |
0.0 < n < 1.0
|
Take the single best score plus |
109.3. Common Terms Query
The common terms query is a modern alternative to stopwords which
improves the precision and recall of search results (by taking stopwords
into account), without sacrificing performance.
The problem
Every term in a query has a cost. A search for "The brown fox"
requires three term queries, one for each of "the", "brown" and
"fox", all of which are executed against all documents in the index.
The query for "the" is likely to match many documents and thus has a
much smaller impact on relevance than the other two terms.
Previously, the solution to this problem was to ignore terms with high
frequency. By treating "the" as a stopword, we reduce the index size
and reduce the number of term queries that need to be executed.
The problem with this approach is that, while stopwords have a small
impact on relevance, they are still important. If we remove stopwords,
we lose precision, (eg we are unable to distinguish between "happy"
and "not happy") and we lose recall (eg text like "The The" or
"To be or not to be" would simply not exist in the index).
The solution
The common terms query divides the query terms into two groups: more
important (ie low frequency terms) and less important (ie high
frequency terms which would previously have been stopwords).
First it searches for documents which match the more important terms. These are the terms which appear in fewer documents and have a greater impact on relevance.
Then, it executes a second query for the less important terms — terms
which appear frequently and have a low impact on relevance. But instead
of calculating the relevance score for all matching documents, it only
calculates the _score for documents already matched by the first
query. In this way the high frequency terms can improve the relevance
calculation without paying the cost of poor performance.
If a query consists only of high frequency terms, then a single query is
executed as an AND (conjunction) query, in other words all terms are
required. Even though each individual term will match many documents,
the combination of terms narrows down the resultset to only the most
relevant. The single query can also be executed as an OR with a
specific
minimum_should_match,
in this case a high enough value should probably be used.
Terms are allocated to the high or low frequency groups based on the
cutoff_frequency, which can be specified as an absolute frequency
(>=1) or as a relative frequency (0.0 .. 1.0). (Remember that document
frequencies are computed on a per shard level as explained in the blog post
Relevance is broken.)
Perhaps the most interesting property of this query is that it adapts to
domain specific stopwords automatically. For example, on a video hosting
site, common terms like "clip" or "video" will automatically behave
as stopwords without the need to maintain a manual list.
Examples
In this example, words that have a document frequency greater than 0.1%
(eg "this" and "is") will be treated as common terms.
{
"common": {
"body": {
"query": "this is bonsai cool",
"cutoff_frequency": 0.001
}
}
}
The number of terms which should match can be controlled with the
minimum_should_match
(high_freq, low_freq), low_freq_operator (default "or") and
high_freq_operator (default "or") parameters.
For low frequency terms, set the low_freq_operator to "and" to make
all terms required:
{
"common": {
"body": {
"query": "nelly the elephant as a cartoon",
"cutoff_frequency": 0.001,
"low_freq_operator": "and"
}
}
}
which is roughly equivalent to:
{
"bool": {
"must": [
{ "term": { "body": "nelly"}},
{ "term": { "body": "elephant"}},
{ "term": { "body": "cartoon"}}
],
"should": [
{ "term": { "body": "the"}}
{ "term": { "body": "as"}}
{ "term": { "body": "a"}}
]
}
}
Alternatively use
minimum_should_match
to specify a minimum number or percentage of low frequency terms which
must be present, for instance:
{
"common": {
"body": {
"query": "nelly the elephant as a cartoon",
"cutoff_frequency": 0.001,
"minimum_should_match": 2
}
}
}
which is roughly equivalent to:
{
"bool": {
"must": {
"bool": {
"should": [
{ "term": { "body": "nelly"}},
{ "term": { "body": "elephant"}},
{ "term": { "body": "cartoon"}}
],
"minimum_should_match": 2
}
},
"should": [
{ "term": { "body": "the"}}
{ "term": { "body": "as"}}
{ "term": { "body": "a"}}
]
}
}
minimum_should_match
A different
minimum_should_match
can be applied for low and high frequency terms with the additional
low_freq and high_freq parameters. Here is an example when providing
additional parameters (note the change in structure):
{
"common": {
"body": {
"query": "nelly the elephant not as a cartoon",
"cutoff_frequency": 0.001,
"minimum_should_match": {
"low_freq" : 2,
"high_freq" : 3
}
}
}
}
which is roughly equivalent to:
{
"bool": {
"must": {
"bool": {
"should": [
{ "term": { "body": "nelly"}},
{ "term": { "body": "elephant"}},
{ "term": { "body": "cartoon"}}
],
"minimum_should_match": 2
}
},
"should": {
"bool": {
"should": [
{ "term": { "body": "the"}},
{ "term": { "body": "not"}},
{ "term": { "body": "as"}},
{ "term": { "body": "a"}}
],
"minimum_should_match": 3
}
}
}
}
In this case it means the high frequency terms have only an impact on
relevance when there are at least three of them. But the most
interesting use of the
minimum_should_match
for high frequency terms is when there are only high frequency terms:
{
"common": {
"body": {
"query": "how not to be",
"cutoff_frequency": 0.001,
"minimum_should_match": {
"low_freq" : 2,
"high_freq" : 3
}
}
}
}
which is roughly equivalent to:
{
"bool": {
"should": [
{ "term": { "body": "how"}},
{ "term": { "body": "not"}},
{ "term": { "body": "to"}},
{ "term": { "body": "be"}}
],
"minimum_should_match": "3<50%"
}
}
The high frequency generated query is then slightly less restrictive
than with an AND.
The common terms query also supports boost, analyzer and
disable_coord as parameters.
109.4. Query String Query
A query that uses a query parser in order to parse its content. Here is an example:
{
"query_string" : {
"default_field" : "content",
"query" : "this AND that OR thus"
}
}
The query_string top level parameters include:
| Parameter | Description |
|---|---|
|
The actual query to be parsed. See Query string syntax. |
|
The default field for query terms if no prefix field
is specified. Defaults to the |
|
The default operator used if no explicit operator
is specified. For example, with a default operator of |
|
The analyzer name used to analyze the query string. |
|
When set, |
|
Whether terms of wildcard, prefix, fuzzy,
and range queries are to be automatically lower-cased or not (since they
are not analyzed). Default it |
|
Set to |
|
Controls the number of terms fuzzy queries will
expand to. Defaults to |
|
Set the fuzziness for fuzzy queries. Defaults
to |
|
Set the prefix length for fuzzy queries. Default
is |
|
Sets the default slop for phrases. If zero, then exact
phrase matches are required. Default value is |
|
Sets the boost value of the query. Defaults to |
|
By default, wildcards terms in a query string are
not analyzed. By setting this value to |
|
Defaults to |
|
Limit on how many automaton states regexp queries are allowed to create. This protects against too-difficult (e.g. exponentially hard) regexps. Defaults to 10000. |
|
A value controlling how many "should" clauses
in the resulting boolean query should match. It can be an absolute value
( |
|
If set to |
|
Locale that should be used for string conversions.
Defaults to |
|
Time Zone to be applied to any range query related to dates. See also JODA timezone. |
When a multi term query is being generated, one can control how it gets rewritten using the rewrite parameter.
Default Field
When not explicitly specifying the field to search on in the query
string syntax, the index.query.default_field will be used to derive
which field to search on. It defaults to _all field.
So, if _all field is disabled, it might make sense to change it to set
a different default field.
Multi Field
The query_string query can also run against multiple fields. Fields can be
provided via the "fields" parameter (example below).
The idea of running the query_string query against multiple fields is to
expand each query term to an OR clause like this:
field1:query_term OR field2:query_term | ...
For example, the following query
{
"query_string" : {
"fields" : ["content", "name"],
"query" : "this AND that"
}
}
matches the same words as
{
"query_string": {
"query": "(content:this OR name:this) AND (content:that OR name:that)"
}
}
Since several queries are generated from the individual search terms,
combining them can be automatically done using either a dis_max query or a
simple bool query. For example (the name is boosted by 5 using ^5
notation):
{
"query_string" : {
"fields" : ["content", "name^5"],
"query" : "this AND that OR thus",
"use_dis_max" : true
}
}
Simple wildcard can also be used to search "within" specific inner
elements of the document. For example, if we have a city object with
several fields (or inner object with fields) in it, we can automatically
search on all "city" fields:
{
"query_string" : {
"fields" : ["city.*"],
"query" : "this AND that OR thus",
"use_dis_max" : true
}
}
Another option is to provide the wildcard fields search in the query
string itself (properly escaping the * sign), for example:
city.\*:something.
When running the query_string query against multiple fields, the
following additional parameters are allowed:
| Parameter | Description |
|---|---|
|
Should the queries be combined using |
|
When using |
The fields parameter can also include pattern based field names, allowing to automatically expand to the relevant fields (dynamically introduced fields included). For example:
{
"query_string" : {
"fields" : ["content", "name.*^5"],
"query" : "this AND that OR thus",
"use_dis_max" : true
}
}
109.4.1. Query string syntax
The query string “mini-language” is used by the
Query String Query and by the
q query string parameter in the search API.
The query string is parsed into a series of terms and operators. A
term can be a single word — quick or brown — or a phrase, surrounded by
double quotes — "quick brown" — which searches for all the words in the
phrase, in the same order.
Operators allow you to customize the search — the available options are explained below.
Field names
As mentioned in Query String Query, the default_field is searched for the
search terms, but it is possible to specify other fields in the query syntax:
-
where the
statusfield containsactivestatus:active
-
where the
titlefield containsquickorbrown. If you omit the OR operator the default operator will be usedtitle:(quick OR brown) title:(quick brown)
-
where the
authorfield contains the exact phrase"john smith"author:"John Smith"
-
where any of the fields
book.title,book.contentorbook.datecontainsquickorbrown(note how we need to escape the*with a backslash):book.\*:(quick brown)
-
where the field
titlehas no value (or is missing):_missing_:title
-
where the field
titlehas any non-null value:_exists_:title
Wildcards
Wildcard searches can be run on individual terms, using ? to replace
a single character, and * to replace zero or more characters:
qu?ck bro*
Be aware that wildcard queries can use an enormous amount of memory and
perform very badly — just think how many terms need to be queried to
match the query string "a* b* c*".
|
|
Allowing a wildcard at the beginning of a word (eg |
Wildcarded terms are not analyzed by default — they are lowercased
(lowercase_expanded_terms defaults to true) but no further analysis
is done, mainly because it is impossible to accurately analyze a word that
is missing some of its letters. However, by setting analyze_wildcard to
true, an attempt will be made to analyze wildcarded words before searching
the term list for matching terms.
Regular expressions
Regular expression patterns can be embedded in the query string by
wrapping them in forward-slashes ("/"):
name:/joh?n(ath[oa]n)/
The supported regular expression syntax is explained in Regular expression syntax.
|
|
The /.*n/ Use with caution! |
Fuzziness
We can search for terms that are similar to, but not exactly like our search terms, using the “fuzzy” operator:
quikc~ brwn~ foks~
This uses the Damerau-Levenshtein distance to find all terms with a maximum of two changes, where a change is the insertion, deletion or substitution of a single character, or transposition of two adjacent characters.
The default edit distance is 2, but an edit distance of 1 should be
sufficient to catch 80% of all human misspellings. It can be specified as:
quikc~1
Proximity searches
While a phrase query (eg "john smith") expects all of the terms in exactly
the same order, a proximity query allows the specified words to be further
apart or in a different order. In the same way that fuzzy queries can
specify a maximum edit distance for characters in a word, a proximity search
allows us to specify a maximum edit distance of words in a phrase:
"fox quick"~5
The closer the text in a field is to the original order specified in the
query string, the more relevant that document is considered to be. When
compared to the above example query, the phrase "quick fox" would be
considered more relevant than "quick brown fox".
Ranges
Ranges can be specified for date, numeric or string fields. Inclusive ranges
are specified with square brackets [min TO max] and exclusive ranges with
curly brackets {min TO max}.
-
All days in 2012:
date:[2012-01-01 TO 2012-12-31]
-
Numbers 1..5
count:[1 TO 5]
-
Tags between
alphaandomega, excludingalphaandomega:tag:{alpha TO omega} -
Numbers from 10 upwards
count:[10 TO *]
-
Dates before 2012
date:{* TO 2012-01-01}
Curly and square brackets can be combined:
-
Numbers from 1 up to but not including 5
count:[1 TO 5}
Ranges with one side unbounded can use the following syntax:
age:>10 age:>=10 age:<10 age:<=10
|
|
To combine an upper and lower bound with the simplified syntax, you
would need to join two clauses with an age:(>=10 AND <20) age:(+>=10 +<20) |
The parsing of ranges in query strings can be complex and error prone. It is
much more reliable to use an explicit range query.
Boosting
Use the boost operator ^ to make one term more relevant than another.
For instance, if we want to find all documents about foxes, but we are
especially interested in quick foxes:
quick^2 fox
The default boost value is 1, but can be any positive floating point number.
Boosts between 0 and 1 reduce relevance.
Boosts can also be applied to phrases or to groups:
"john smith"^2 (foo bar)^4
Boolean operators
By default, all terms are optional, as long as one term matches. A search
for foo bar baz will find any document that contains one or more of
foo or bar or baz. We have already discussed the default_operator
above which allows you to force all terms to be required, but there are
also boolean operators which can be used in the query string itself
to provide more control.
The preferred operators are + (this term must be present) and -
(this term must not be present). All other terms are optional.
For example, this query:
quick brown +fox -news
states that:
-
foxmust be present -
newsmust not be present -
quickandbrownare optional — their presence increases the relevance
The familiar operators AND, OR and NOT (also written &&, || and !)
are also supported. However, the effects of these operators can be more
complicated than is obvious at first glance. NOT takes precedence over
AND, which takes precedence over OR. While the + and - only affect
the term to the right of the operator, AND and OR can affect the terms to
the left and right.
Grouping
Multiple terms or clauses can be grouped together with parentheses, to form sub-queries:
(quick OR brown) AND fox
Groups can be used to target a particular field, or to boost the result of a sub-query:
status:(active OR pending) title:(full text search)^2
Reserved characters
If you need to use any of the characters which function as operators in your
query itself (and not as operators), then you should escape them with
a leading backslash. For instance, to search for (1+1)=2, you would
need to write your query as \(1\+1\)\=2.
The reserved characters are: + - = && || > < ! ( ) { } [ ] ^ " ~ * ? : \ /
Failing to escape these special characters correctly could lead to a syntax error which prevents your query from running.
109.5. Simple Query String Query
A query that uses the SimpleQueryParser to parse its context. Unlike the
regular query_string query, the simple_query_string query will never
throw an exception, and discards invalid parts of the query. Here is
an example:
{
"simple_query_string" : {
"query": "\"fried eggs\" +(eggplant | potato) -frittata",
"analyzer": "snowball",
"fields": ["body^5","_all"],
"default_operator": "and"
}
}
The simple_query_string top level parameters include:
| Parameter | Description |
|---|---|
|
The actual query to be parsed. See below for syntax. |
|
The fields to perform the parsed query against. Defaults to the
|
|
The default operator used if no explicit operator
is specified. For example, with a default operator of |
|
The analyzer used to analyze each term of the query when creating composite queries. |
|
Flags specifying which features of the |
|
Whether terms of prefix and fuzzy queries should
be automatically lower-cased or not (since they are not analyzed). Defaults to
|
|
Whether terms of prefix queries should be automatically
analyzed or not. If |
|
Locale that should be used for string conversions.
Defaults to |
|
If set to |
|
The minimum number of clauses that must match for a
document to be returned. See the
|
Simple Query String Syntax
The simple_query_string supports the following special characters:
-
+signifies AND operation -
|signifies OR operation -
-negates a single token -
"wraps a number of tokens to signify a phrase for searching -
*at the end of a term signifies a prefix query -
(and)signify precedence -
~Nafter a word signifies edit distance (fuzziness) -
~Nafter a phrase signifies slop amount
In order to search for any of these special characters, they will need to
be escaped with \.
Default Field
When not explicitly specifying the field to search on in the query
string syntax, the index.query.default_field will be used to derive
which field to search on. It defaults to _all field.
So, if _all field is disabled, it might make sense to change it to set
a different default field.
Multi Field
The fields parameter can also include pattern based field names, allowing to automatically expand to the relevant fields (dynamically introduced fields included). For example:
{
"simple_query_string" : {
"fields" : ["content", "name.*^5"],
"query" : "foo bar baz"
}
}
Flags
simple_query_string support multiple flags to specify which parsing features
should be enabled. It is specified as a |-delimited string with the
flags parameter:
{
"simple_query_string" : {
"query" : "foo | bar + baz*",
"flags" : "OR|AND|PREFIX"
}
}
The available flags are: ALL, NONE, AND, OR, NOT, PREFIX, PHRASE,
PRECEDENCE, ESCAPE, WHITESPACE, FUZZY, NEAR, and SLOP.
110. Term level queries
While the full text queries will analyze the query string before executing, the term-level queries operate on the exact terms that are stored in the inverted index.
These queries are usually used for structured data like numbers, dates, and enums, rather than full text fields. Alternatively, they allow you to craft low-level queries, foregoing the analysis process.
The queries in this group are:
termquery-
Find documents which contain the exact term specified in the field specified.
termsquery-
Find documents which contain any of the exact terms specified in the field specified.
rangequery-
Find documents where the field specified contains values (dates, numbers, or strings) in the range specified.
existsquery-
Find documents where the field specified contains any non-null value.
missingquery-
Find documents where the field specified does is missing or contains only
nullvalues. prefixquery-
Find documents where the field specified contains terms which being with the exact prefix specified.
wildcardquery-
Find documents where the field specified contains terms which match the pattern specified, where the pattern supports single character wildcards (
?) and multi-character wildcards (*) regexpquery-
Find documents where the field specified contains terms which match the regular expression specified.
fuzzyquery-
Find documents where the field specified contains terms which are fuzzily similar to the specified term. Fuzziness is measured as a Levenshtein edit distance of 1 or 2.
typequery-
Find documents of the specified type.
idsquery-
Find documents with the specified type and IDs.
110.1. Term Query
The term query finds documents that contain the exact term specified
in the inverted index. For instance:
{
"term" : { "user" : "Kimchy" }
}
Finds documents which contain the exact term Kimchy in the inverted index
of the user field. |
A boost parameter can be specified to give this term query a higher
relevance score than another query, for instance:
GET /_search
{
"query": {
"bool": {
"should": [
{
"term": {
"status": {
"value": "urgent",
"boost": 2.0
}
}
},
{
"term": {
"status": "normal"
}
}
]
}
}
}
The urgent query clause has a boost of 2.0, meaning it is twice as important
as the query clause for normal. |
|
The normal clause has the default neutral boost of 1.0. |
110.2. Terms Query
Filters documents that have fields that match any of the provided terms (not analyzed). For example:
{
"constant_score" : {
"filter" : {
"terms" : { "user" : ["kimchy", "elasticsearch"]}
}
}
}
The terms query is also aliased with in as the filter name for
simpler usage.
Terms lookup mechanism
When it’s needed to specify a terms filter with a lot of terms it can
be beneficial to fetch those term values from a document in an index. A
concrete example would be to filter tweets tweeted by your followers.
Potentially the amount of user ids specified in the terms filter can be
a lot. In this scenario it makes sense to use the terms filter’s terms
lookup mechanism.
The terms lookup mechanism supports the following options:
index
|
The index to fetch the term values from. Defaults to the current index. |
type
|
The type to fetch the term values from. |
id
|
The id of the document to fetch the term values from. |
path
|
The field specified as path to fetch the actual values for the
|
routing
|
A custom routing value to be used when retrieving the external terms doc. |
The values for the terms filter will be fetched from a field in a
document with the specified id in the specified type and index.
Internally a get request is executed to fetch the values from the
specified path. At the moment for this feature to work the _source
needs to be stored.
Also, consider using an index with a single shard and fully replicated across all nodes if the "reference" terms data is not large. The lookup terms filter will prefer to execute the get request on a local node if possible, reducing the need for networking.
Terms lookup twitter example
# index the information for user with id 2, specifically, its followers
curl -XPUT localhost:9200/users/user/2 -d '{
"followers" : ["1", "3"]
}'
# index a tweet, from user with id 2
curl -XPUT localhost:9200/tweets/tweet/1 -d '{
"user" : "2"
}'
# search on all the tweets that match the followers of user 2
curl -XGET localhost:9200/tweets/_search -d '{
"query" : {
"terms" : {
"user" : {
"index" : "users",
"type" : "user",
"id" : "2",
"path" : "followers"
}
}
}
}'
The structure of the external terms document can also include array of inner objects, for example:
curl -XPUT localhost:9200/users/user/2 -d '{
"followers" : [
{
"id" : "1"
},
{
"id" : "2"
}
]
}'
In which case, the lookup path will be followers.id.
110.3. Range Query
Matches documents with fields that have terms within a certain range.
The type of the Lucene query depends on the field type, for string
fields, the TermRangeQuery, while for number/date fields, the query is
a NumericRangeQuery. The following example returns all documents where
age is between 10 and 20:
{
"range" : {
"age" : {
"gte" : 10,
"lte" : 20,
"boost" : 2.0
}
}
}
The range query accepts the following parameters:
gte
|
Greater-than or equal to |
gt
|
Greater-than |
lte
|
Less-than or equal to |
lt
|
Less-than |
boost
|
Sets the boost value of the query, defaults to |
110.3.1. Ranges on date fields
{
"range" : {
"date" : {
"gte" : "now-1d/d",
"lt" : "now/d"
}
}
}
Date math and rounding
When using date math to round dates to the nearest day, month, hour, etc, the rounded dates depend on whether the ends of the ranges are inclusive or exclusive.
Rounding up moves to the last millisecond of the rounding scope, and rounding down to the first millisecond of the rounding scope. For example:
gt
|
Greater than the date rounded up: |
gte
|
Greater than or equal to the date rounded down: |
lt
|
Less than the date rounded down: |
lte
|
Less than or equal to the date rounded up: |
Date format in range queries
Formatted dates will be parsed using the format
specified on the date field by default, but it can be overridden by
passing the format parameter to the range query:
{
"range" : {
"born" : {
"gte": "01/01/2012",
"lte": "2013",
"format": "dd/MM/yyyy||yyyy"
}
}
}
Time zone in range queries
Dates can be converted from another timezone to UTC either by specifying the
time zone in the date value itself (if the format
accepts it), or it can be specified as the time_zone parameter:
{
"range" : {
"timestamp" : {
"gte": "2015-01-01 00:00:00",
"lte": "now",
"time_zone": "+01:00"
}
}
}
This date will be converted to 2014-12-31T23:00:00 UTC. |
|
now is not affected by the time_zone parameter (dates must be stored as UTC). |
110.4. Exists Query
Returns documents that have at least one non-null value in the original field:
{
"exists" : { "field" : "user" }
}
For instance, these documents would all match the above query:
{ "user": "jane" }
{ "user": "" }
{ "user": "-" }
{ "user": ["jane"] }
{ "user": ["jane", null ] } 
An empty string is a non-null value. |
|
Even though the standard analyzer would emit zero tokens, the original field is non-null. |
|
At least one non-null value is required. |
These documents would not match the above query:
{ "user": null }
{ "user": [] }
{ "user": [null] }
{ "foo": "bar" } 
| This field has no values. | |
At least one non-null value is required. |
|
The user field is missing completely. |
null_value mapping
If the field mapping includes the null_value setting
then explicit null values are replaced with the specified null_value. For
instance, if the user field were mapped as follows:
"user": {
"type": "string",
"null_value": "_null_"
}
then explicit null values would be indexed as the string _null_, and the
following docs would match the exists filter:
{ "user": null }
{ "user": [null] }
However, these docs—without explicit null values—would still have
no values in the user field and thus would not match the exists filter:
{ "user": [] }
{ "foo": "bar" }
110.5. Missing Query
deprecated[2.2.0, Use exists query inside a must_not clause instead]
Returns documents that have only null values or no value in the original field:
{
"constant_score" : {
"filter" : {
"missing" : { "field" : "user" }
}
}
}
For instance, the following docs would match the above filter:
{ "user": null }
{ "user": [] }
{ "user": [null] }
{ "foo": "bar" } 
| This field has no values. | |
This field has no non-null values. |
|
The user field is missing completely. |
These documents would not match the above filter:
{ "user": "jane" }
{ "user": "" }
{ "user": "-" }
{ "user": ["jane"] }
{ "user": ["jane", null ] } 
An empty string is a non-null value. |
|
Even though the standard analyzer would emit zero tokens, the original field is non-null. |
|
This field has one non-null value. |
null_value mapping
If the field mapping includes a null_value then explicit null values
are replaced with the specified null_value. For instance, if the user field were mapped
as follows:
"user": {
"type": "string",
"null_value": "_null_"
}
then explicit null values would be indexed as the string _null_, and the
the following docs would not match the missing filter:
{ "user": null }
{ "user": [null] }
However, these docs—without explicit null values—would still have
no values in the user field and thus would match the missing filter:
{ "user": [] }
{ "foo": "bar" }
existence and null_value parameters
When the field being queried has a null_value mapping, then the behaviour of
the missing filter can be altered with the existence and null_value
parameters:
{
"constant_score" : {
"filter" : {
"missing" : {
"field" : "user",
"existence" : true,
"null_value" : false
}
}
}
}
existence-
When the
existenceparameter is set totrue(the default), the missing filter will include documents where the field has no values, ie:{ "user": [] } { "foo": "bar" }When set to
false, these documents will not be included. null_value-
When the
null_valueparameter is set totrue, the missing filter will include documents where the field contains anullvalue, ie:{ "user": null } { "user": [null] } { "user": ["jane",null] }

Matches because the field contains a nullvalue, even though it also contains a non-nullvalue.When set to
false(the default), these documents will not be included.
|
|
Either existence or null_value or both must be set to true.
|
110.6. Prefix Query
Matches documents that have fields containing terms with a specified
prefix (not analyzed). The prefix query maps to Lucene PrefixQuery.
The following matches documents where the user field contains a term
that starts with ki:
{
"prefix" : { "user" : "ki" }
}
A boost can also be associated with the query:
{
"prefix" : { "user" : { "value" : "ki", "boost" : 2.0 } }
}
Or :
{
"prefix" : { "user" : { "prefix" : "ki", "boost" : 2.0 } }
}
This multi term query allows you to control how it gets rewritten using the rewrite parameter.
110.7. Wildcard Query
Matches documents that have fields matching a wildcard expression (not
analyzed). Supported wildcards are *, which matches any character
sequence (including the empty one), and ?, which matches any single
character. Note this query can be slow, as it needs to iterate over many
terms. In order to prevent extremely slow wildcard queries, a wildcard
term should not start with one of the wildcards * or ?. The wildcard
query maps to Lucene WildcardQuery.
{
"wildcard" : { "user" : "ki*y" }
}
A boost can also be associated with the query:
{
"wildcard" : { "user" : { "value" : "ki*y", "boost" : 2.0 } }
}
Or :
{
"wildcard" : { "user" : { "wildcard" : "ki*y", "boost" : 2.0 } }
}
This multi term query allows to control how it gets rewritten using the rewrite parameter.
110.8. Regexp Query
The regexp query allows you to use regular expression term queries.
See Regular expression syntax for details of the supported regular expression language.
The "term queries" in that first sentence means that Elasticsearch will apply
the regexp to the terms produced by the tokenizer for that field, and not
to the original text of the field.
Note: The performance of a regexp query heavily depends on the
regular expression chosen. Matching everything like .* is very slow as
well as using lookaround regular expressions. If possible, you should
try to use a long prefix before your regular expression starts. Wildcard
matchers like .*?+ will mostly lower performance.
{
"regexp":{
"name.first": "s.*y"
}
}
Boosting is also supported
{
"regexp":{
"name.first":{
"value":"s.*y",
"boost":1.2
}
}
}
You can also use special flags
{
"regexp":{
"name.first": {
"value": "s.*y",
"flags" : "INTERSECTION|COMPLEMENT|EMPTY"
}
}
}
Possible flags are ALL (default), ANYSTRING, COMPLEMENT,
EMPTY, INTERSECTION, INTERVAL, or NONE. Please check the
Lucene
documentation for their meaning
Regular expressions are dangerous because it’s easy to accidentally
create an innocuous looking one that requires an exponential number of
internal determinized automaton states (and corresponding RAM and CPU)
for Lucene to execute. Lucene prevents these using the
max_determinized_states setting (defaults to 10000). You can raise
this limit to allow more complex regular expressions to execute.
{
"regexp":{
"name.first": {
"value": "s.*y",
"flags" : "INTERSECTION|COMPLEMENT|EMPTY",
"max_determinized_states": 20000
}
}
}
110.8.1. Regular expression syntax
Regular expression queries are supported by the regexp and the query_string
queries. The Lucene regular expression engine
is not Perl-compatible but supports a smaller range of operators.
|
|
We will not attempt to explain regular expressions, but just explain the supported operators. |
Standard operators
- Anchoring
-
Most regular expression engines allow you to match any part of a string. If you want the regexp pattern to start at the beginning of the string or finish at the end of the string, then you have to anchor it specifically, using
^to indicate the beginning or$to indicate the end.Lucene’s patterns are always anchored. The pattern provided must match the entire string. For string
"abcde":ab.* # match abcd # no match
- Allowed characters
-
Any Unicode characters may be used in the pattern, but certain characters are reserved and must be escaped. The standard reserved characters are:
. ? + * | { } [ ] ( ) " \If you enable optional features (see below) then these characters may also be reserved:
# @ & < > ~
Any reserved character can be escaped with a backslash
"\*"including a literal backslash character:"\\"Additionally, any characters (except double quotes) are interpreted literally when surrounded by double quotes:
john"@smith.com"
- Match any character
-
The period
"."can be used to represent any character. For string"abcde":ab... # match a.c.e # match
- One-or-more
-
The plus sign
"+"can be used to repeat the preceding shortest pattern once or more times. For string"aaabbb":a+b+ # match aa+bb+ # match a+.+ # match aa+bbb+ # match
- Zero-or-more
-
The asterisk
"*"can be used to match the preceding shortest pattern zero-or-more times. For string"aaabbb":a*b* # match a*b*c* # match .*bbb.* # match aaa*bbb* # match
- Zero-or-one
-
The question mark
"?"makes the preceding shortest pattern optional. It matches zero or one times. For string"aaabbb":aaa?bbb? # match aaaa?bbbb? # match .....?.? # match aa?bb? # no match
- Min-to-max
-
Curly brackets
"{}"can be used to specify a minimum and (optionally) a maximum number of times the preceding shortest pattern can repeat. The allowed forms are:{5} # repeat exactly 5 times {2,5} # repeat at least twice and at most 5 times {2,} # repeat at least twiceFor string
"aaabbb":a{3}b{3} # match a{2,4}b{2,4} # match a{2,}b{2,} # match .{3}.{3} # match a{4}b{4} # no match a{4,6}b{4,6} # no match a{4,}b{4,} # no match - Grouping
-
Parentheses
"()"can be used to form sub-patterns. The quantity operators listed above operate on the shortest previous pattern, which can be a group. For string"ababab":(ab)+ # match ab(ab)+ # match (..)+ # match (...)+ # no match (ab)* # match abab(ab)? # match ab(ab)? # no match (ab){3} # match (ab){1,2} # no match - Alternation
-
The pipe symbol
"|"acts as an OR operator. The match will succeed if the pattern on either the left-hand side OR the right-hand side matches. The alternation applies to the longest pattern, not the shortest. For string"aabb":aabb|bbaa # match aacc|bb # no match aa(cc|bb) # match a+|b+ # no match a+b+|b+a+ # match a+(b|c)+ # match
- Character classes
-
Ranges of potential characters may be represented as character classes by enclosing them in square brackets
"[]". A leading^negates the character class. The allowed forms are:[abc] # 'a' or 'b' or 'c' [a-c] # 'a' or 'b' or 'c' [-abc] # '-' or 'a' or 'b' or 'c' [abc\-] # '-' or 'a' or 'b' or 'c' [^abc] # any character except 'a' or 'b' or 'c' [^a-c] # any character except 'a' or 'b' or 'c' [^-abc] # any character except '-' or 'a' or 'b' or 'c' [^abc\-] # any character except '-' or 'a' or 'b' or 'c'
Note that the dash
"-"indicates a range of characters, unless it is the first character or if it is escaped with a backslash.For string
"abcd":ab[cd]+ # match [a-d]+ # match [^a-d]+ # no match
Optional operators
These operators are available by default as the flags parameter defaults to ALL.
Different flag combinations (concatened with "\") can be used to enable/disable
specific operators:
{
"regexp": {
"username": {
"value": "john~athon<1-5>",
"flags": "COMPLEMENT|INTERVAL"
}
}
}
- Complement
-
The complement is probably the most useful option. The shortest pattern that follows a tilde
"~"is negated. For instance, `"ab~cd" means:-
Starts with
a -
Followed by
b -
Followed by a string of any length that it anything but
c -
Ends with
d
For the string
"abcdef":ab~df # match ab~cf # match ab~cdef # no match a~(cb)def # match a~(bc)def # no match
Enabled with the
COMPLEMENTorALLflags. -
- Interval
-
The interval option enables the use of numeric ranges, enclosed by angle brackets
"<>". For string:"foo80":foo<1-100> # match foo<01-100> # match foo<001-100> # no match
Enabled with the
INTERVALorALLflags. - Intersection
-
The ampersand
"&"joins two patterns in a way that both of them have to match. For string"aaabbb":aaa.+&.+bbb # match aaa&bbb # no match
Using this feature usually means that you should rewrite your regular expression.
Enabled with the
INTERSECTIONorALLflags. - Any string
-
The at sign
"@"matches any string in its entirety. This could be combined with the intersection and complement above to express “everything except”. For instance:@&~(foo.+) # anything except string beginning with "foo"
Enabled with the
ANYSTRINGorALLflags.
110.9. Fuzzy Query
The fuzzy query uses similarity based on Levenshtein edit distance for
string fields, and a +/- margin on numeric and date fields.
110.9.1. String fields
The fuzzy query generates all possible matching terms that are within the
maximum edit distance specified in fuzziness and then checks the term
dictionary to find out which of those generated terms actually exist in the
index.
Here is a simple example:
{
"fuzzy" : { "user" : "ki" }
}
Or with more advanced settings:
{
"fuzzy" : {
"user" : {
"value" : "ki",
"boost" : 1.0,
"fuzziness" : 2,
"prefix_length" : 0,
"max_expansions": 100
}
}
}
Parameters
fuzziness
|
The maximum edit distance. Defaults to |
prefix_length
|
The number of initial characters which will not be “fuzzified”. This
helps to reduce the number of terms which must be examined. Defaults
to |
max_expansions
|
The maximum number of terms that the |
|
|
This query can be very heavy if prefix_length is set to 0 and if
max_expansions is set to a high number. It could result in every term in the
index being examined!
|
Numeric and date fields
Performs a Range Query “around” the value using the
fuzziness value as a +/- range, where:
-fuzziness <= field value <= +fuzziness
For example:
{
"fuzzy" : {
"price" : {
"value" : 12,
"fuzziness" : 2
}
}
}
Will result in a range query between 10 and 14. Date fields support time values, eg:
{
"fuzzy" : {
"created" : {
"value" : "2010-02-05T12:05:07",
"fuzziness" : "1d"
}
}
}
See Fuzziness for more details about accepted values.
110.10. Type Query
Filters documents matching the provided document / mapping type.
{
"type" : {
"value" : "my_type"
}
}
110.11. Ids Query
Filters documents that only have the provided ids. Note, this query uses the _uid field.
{
"ids" : {
"type" : "my_type",
"values" : ["1", "4", "100"]
}
}
The type is optional and can be omitted, and can also accept an array
of values. If no type is specified, all types defined in the index mapping are tried.
111. Compound queries
Compound queries wrap other compound or leaf queries, either to combine their results and scores, to change their behaviour, or to switch from query to filter context.
The queries in this group are:
constant_scorequery-
A query which wraps another query, but executes it in filter context. All matching documents are given the same “constant”
_score. boolquery-
The default query for combining multiple leaf or compound query clauses, as
must,should,must_not, orfilterclauses. Themustandshouldclauses have their scores combined — the more matching clauses, the better — while themust_notandfilterclauses are executed in filter context. dis_maxquery-
A query which accepts multiple queries, and returns any documents which match any of the query clauses. While the
boolquery combines the scores from all matching queries, thedis_maxquery uses the score of the single best- matching query clause. function_scorequery-
Modify the scores returned by the main query with functions to take into account factors like popularity, recency, distance, or custom algorithms implemented with scripting.
boostingquery-
Return documents which match a
positivequery, but reduce the score of documents which also match anegativequery. indicesquery-
Execute one query for the specified indices, and another for other indices.
and,or,not-
Synonyms for the
boolquery. filteredquery-
Combine a query clause in query context with another in filter context. deprecated[2.0.0-beta1,Use the
boolquery instead] limitquery-
Limits the number of documents examined per shard.
111.1. Constant Score Query
A query that wraps another query and simply returns a
constant score equal to the query boost for every document in the
filter. Maps to Lucene ConstantScoreQuery.
{
"constant_score" : {
"filter" : {
"term" : { "user" : "kimchy"}
},
"boost" : 1.2
}
}
111.2. Bool Query
A query that matches documents matching boolean combinations of other
queries. The bool query maps to Lucene BooleanQuery. It is built using
one or more boolean clauses, each clause with a typed occurrence. The
occurrence types are:
| Occur | Description |
|---|---|
|
The clause (query) must appear in matching documents and will contribute to the score. |
|
The clause (query) must appear in matching documents. However unlike
|
|
The clause (query) should appear in the matching document. In
a boolean query with no |
|
The clause (query) must not appear in the matching documents. |
|
|
Bool query in filter context
If this query is used in a filter context and it has |
The bool query also supports disable_coord parameter (defaults to
false). Basically the coord similarity computes a score factor based
on the fraction of all query terms that a document contains. See Lucene
BooleanQuery for more details.
The bool query takes a more-matches-is-better approach, so the score from
each matching must or should clause will be added together to provide the
final _score for each document.
{
"bool" : {
"must" : {
"term" : { "user" : "kimchy" }
},
"filter": {
"term" : { "tag" : "tech" }
},
"must_not" : {
"range" : {
"age" : { "from" : 10, "to" : 20 }
}
},
"should" : [
{
"term" : { "tag" : "wow" }
},
{
"term" : { "tag" : "elasticsearch" }
}
],
"minimum_should_match" : 1,
"boost" : 1.0
}
}
111.2.1. Scoring with bool.filter
Queries specified under the filter element have no effect on scoring — scores are returned as 0. Scores are only affected by the query that has
been specified. For instance, all three of the following queries return
all documents where the status field contains the term active.
This first query assigns a score of 0 to all documents, as no scoring
query has been specified:
GET _search
{
"query": {
"bool": {
"filter": {
"term": {
"status": "active"
}
}
}
}
}
This bool query has a match_all query, which assigns a score of 1.0 to
all documents.
GET _search
{
"query": {
"bool": {
"query": {
"match_all": {}
},
"filter": {
"term": {
"status": "active"
}
}
}
}
}
This constant_score query behaves in exactly the same way as the second example above.
The constant_score query assigns a score of 1.0 to all documents matched
by the filter.
GET _search
{
"query": {
"constant_score": {
"filter": {
"term": {
"status": "active"
}
}
}
}
}
111.2.2. Using named queries to see which clauses matched
If you need to know which of the clauses in the bool query matched the documents returned from the query, you can use named queries to assign a name to each clause.
111.3. Dis Max Query
A query that generates the union of documents produced by its subqueries, and that scores each document with the maximum score for that document as produced by any subquery, plus a tie breaking increment for any additional matching subqueries.
This is useful when searching for a word in multiple fields with different boost factors (so that the fields cannot be combined equivalently into a single search field). We want the primary score to be the one associated with the highest boost, not the sum of the field scores (as Boolean Query would give). If the query is "albino elephant" this ensures that "albino" matching one field and "elephant" matching another gets a higher score than "albino" matching both fields. To get this result, use both Boolean Query and DisjunctionMax Query: for each term a DisjunctionMaxQuery searches for it in each field, while the set of these DisjunctionMaxQuery’s is combined into a BooleanQuery.
The tie breaker capability allows results that include the same term in
multiple fields to be judged better than results that include this term
in only the best of those multiple fields, without confusing this with
the better case of two different terms in the multiple fields.The
default tie_breaker is 0.0.
This query maps to Lucene DisjunctionMaxQuery.
{
"dis_max" : {
"tie_breaker" : 0.7,
"boost" : 1.2,
"queries" : [
{
"term" : { "age" : 34 }
},
{
"term" : { "age" : 35 }
}
]
}
}
111.4. Function Score Query
The function_score allows you to modify the score of documents that are
retrieved by a query. This can be useful if, for example, a score
function is computationally expensive and it is sufficient to compute
the score on a filtered set of documents.
To use function_score, the user has to define a query and one or
more functions, that compute a new score for each document returned
by the query.
function_score can be used with only one function like this:
"function_score": {
"query": {},
"boost": "boost for the whole query",
"FUNCTION": {},
"boost_mode":"(multiply|replace|...)"
}
| See [score-functions] for a list of supported functions. |
Furthermore, several functions can be combined. In this case one can optionally choose to apply the function only if a document matches a given filtering query
"function_score": {
"query": {},
"boost": "boost for the whole query",
"functions": [
{
"filter": {},
"FUNCTION": {},
"weight": number
},
{
"FUNCTION": {}
},
{
"filter": {},
"weight": number
}
],
"max_boost": number,
"score_mode": "(multiply|max|...)",
"boost_mode": "(multiply|replace|...)",
"min_score" : number
}
| See [score-functions] for a list of supported functions. |
|
|
The scores produced by the filtering query of each function do not matter. |
If no query is given with a function this is equivalent to specifying
"match_all": {}
First, each document is scored by the defined functions. The parameter
score_mode specifies how the computed scores are combined:
multiply
|
scores are multiplied (default) |
sum
|
scores are summed |
avg
|
scores are averaged |
first
|
the first function that has a matching filter is applied |
max
|
maximum score is used |
min
|
minimum score is used |
Because scores can be on different scales (for example, between 0 and 1 for decay functions but arbitrary for field_value_factor) and also because sometimes a different impact of functions on the score is desirable, the score of each function can be adjusted with a user defined weight (). The weight can be defined per function in the functions array (example above) and is multiplied with the score computed by the respective function.
If weight is given without any other function declaration, weight acts as a function that simply returns the weight.
The new score can be restricted to not exceed a certain limit by setting
the max_boost parameter. The default for max_boost is FLT_MAX.
The newly computed score is combined with the score of the
query. The parameter boost_mode defines how:
multiply
|
query score and function score is multiplied (default) |
replace
|
only function score is used, the query score is ignored |
sum
|
query score and function score are added |
avg
|
average |
max
|
max of query score and function score |
min
|
min of query score and function score |
By default, modifying the score does not change which documents match. To exclude
documents that do not meet a certain score threshold the min_score parameter can be set to the desired score threshold.
The function_score query provides several types of score functions.
-
decay functions:
gauss,linear,exp
111.4.1. Script score
The script_score function allows you to wrap another query and customize
the scoring of it optionally with a computation derived from other numeric
field values in the doc using a script expression. Here is a
simple sample:
"script_score" : {
"script" : "_score * doc['my_numeric_field'].value"
}
On top of the different scripting field values and expression, the
_score script parameter can be used to retrieve the score based on the
wrapped query.
Scripts are cached for faster execution. If the script has parameters that it needs to take into account, it is preferable to reuse the same script, and provide parameters to it:
"script_score": {
"script": {
"lang": "lang",
"params": {
"param1": value1,
"param2": value2
},
"inline": "_score * doc['my_numeric_field'].value / pow(param1, param2)"
}
}
Note that unlike the custom_score query, the
score of the query is multiplied with the result of the script scoring. If
you wish to inhibit this, set "boost_mode": "replace"
111.4.2. Weight
The weight score allows you to multiply the score by the provided
weight. This can sometimes be desired since boost value set on
specific queries gets normalized, while for this score function it does
not.
"weight" : number
111.4.3. Random
The random_score generates scores using a hash of the _uid field,
with a seed for variation. If seed is not specified, the current
time is used.
|
|
Using this feature will load field data for _uid, which can
be a memory intensive operation since the values are unique.
|
"random_score": {
"seed" : number
}
111.4.4. Field Value factor
The field_value_factor function allows you to use a field from a document to
influence the score. It’s similar to using the script_score function, however,
it avoids the overhead of scripting. If used on a multi-valued field, only the
first value of the field is used in calculations.
As an example, imagine you have a document indexed with a numeric popularity
field and wish to influence the score of a document with this field, an example
doing so would look like:
"field_value_factor": {
"field": "popularity",
"factor": 1.2,
"modifier": "sqrt",
"missing": 1
}
Which will translate into the following formula for scoring:
sqrt(1.2 * doc['popularity'].value)
There are a number of options for the field_value_factor function:
field
|
Field to be extracted from the document. |
factor
|
Optional factor to multiply the field value with, defaults to |
modifier
|
Modifier to apply to the field value, can be one of: |
| Modifier | Meaning |
|---|---|
|
Do not apply any multiplier to the field value |
|
Take the logarithm of the field value |
|
Add 1 to the field value and take the logarithm |
|
Add 2 to the field value and take the logarithm |
|
Take the natural logarithm of the field value |
|
Add 1 to the field value and take the natural logarithm |
|
Add 2 to the field value and take the natural logarithm |
|
Square the field value (multiply it by itself) |
|
Take the square root of the field value |
|
Reciprocate the field value, same as |
missing-
Value used if the document doesn’t have that field. The modifier and factor are still applied to it as though it were read from the document.
Keep in mind that taking the log() of 0, or the square root of a negative number is an illegal operation, and an exception will be thrown. Be sure to limit the values of the field with a range filter to avoid this, or use `log1p` and `ln1p`.
111.4.5. Decay functions
Decay functions score a document with a function that decays depending on the distance of a numeric field value of the document from a user given origin. This is similar to a range query, but with smooth edges instead of boxes.
To use distance scoring on a query that has numerical fields, the user
has to define an origin and a scale for each field. The origin
is needed to define the “central point” from which the distance
is calculated, and the scale to define the rate of decay. The
decay function is specified as
"DECAY_FUNCTION": {
"FIELD_NAME": {
"origin": "11, 12",
"scale": "2km",
"offset": "0km",
"decay": 0.33
}
}
The DECAY_FUNCTION should be one of linear, exp, or gauss. |
|
| The specified field must be a numeric, date, or geo-point field. |
In the above example, the field is a geo_point and origin can be provided in geo format. scale and offset must be given with a unit in this case. If your field is a date field, you can set scale and offset as days, weeks, and so on. Example:
"gauss": {
"date": {
"origin": "2013-09-17",
"scale": "10d",
"offset": "5d",
"decay" : 0.5
}
}
The date format of the origin depends on the format defined in
your mapping. If you do not define the origin, the current time is used. |
|
The offset and decay parameters are optional. |
origin
|
The point of origin used for calculating distance. Must be given as a
number for numeric field, date for date fields and geo point for geo fields.
Required for geo and numeric field. For date fields the default is |
scale
|
Required for all types. Defines the distance from origin at which the computed
score will equal |
offset
|
If an |
decay
|
The |
In the first example, your documents might represents hotels and contain a geo location field. You want to compute a decay function depending on how far the hotel is from a given location. You might not immediately see what scale to choose for the gauss function, but you can say something like: "At a distance of 2km from the desired location, the score should be reduced to one third." The parameter "scale" will then be adjusted automatically to assure that the score function computes a score of 0.33 for hotels that are 2km away from the desired location.
In the second example, documents with a field value between 2013-09-12 and 2013-09-22 would get a weight of 1.0 and documents which are 15 days from that date a weight of 0.5.
Supported decay functions
The DECAY_FUNCTION determines the shape of the decay:
gauss-
Normal decay, computed as:

where
is computed to assure that the score takes the value decayat distancescalefromorigin+-offset
See Normal decay, keyword
gaussfor graphs demonstrating the curve generated by thegaussfunction. exp-
Exponential decay, computed as:

where again the parameter
is computed to assure that the score takes the value decayat distancescalefromorigin+-offset
See Exponential decay, keyword
expfor graphs demonstrating the curve generated by theexpfunction. linear-
Linear decay, computed as:
.where again the parameter
sis computed to assure that the score takes the valuedecayat distancescalefromorigin+-offset
In contrast to the normal and exponential decay, this function actually sets the score to 0 if the field value exceeds twice the user given scale value.
For single functions the three decay functions together with their parameters can be visualized like this (the field in this example called "age"):

Multi-values fields
If a field used for computing the decay contains multiple values, per default the value closest to the origin is chosen for determining the distance.
This can be changed by setting multi_value_mode.
min
|
Distance is the minimum distance |
max
|
Distance is the maximum distance |
avg
|
Distance is the average distance |
sum
|
Distance is the sum of all distances |
Example:
"DECAY_FUNCTION": {
"FIELD_NAME": {
"origin": ...,
"scale": ...
},
"multi_value_mode": "avg"
}
111.4.6. Detailed example
Suppose you are searching for a hotel in a certain town. Your budget is limited. Also, you would like the hotel to be close to the town center, so the farther the hotel is from the desired location the less likely you are to check in.
You would like the query results that match your criterion (for example, "hotel, Nancy, non-smoker") to be scored with respect to distance to the town center and also the price.
Intuitively, you would like to define the town center as the origin and
maybe you are willing to walk 2km to the town center from the hotel.
In this case your origin for the location field is the town center
and the scale is ~2km.
If your budget is low, you would probably prefer something cheap above something expensive. For the price field, the origin would be 0 Euros and the scale depends on how much you are willing to pay, for example 20 Euros.
In this example, the fields might be called "price" for the price of the hotel and "location" for the coordinates of this hotel.
The function for price in this case would be
"gauss": {
"price": {
"origin": "0",
"scale": "20"
}
}
This decay function could also be linear or exp. |
and for location:
"gauss": {
"location": {
"origin": "11, 12",
"scale": "2km"
}
}
This decay function could also be linear or exp. |
Suppose you want to multiply these two functions on the original score, the request would look like this:
GET /hotels/_search/
{
"query": {
"function_score": {
"functions": [
{
"gauss": {
"price": {
"origin": "0",
"scale": "20"
}
}
},
{
"gauss": {
"location": {
"origin": "11, 12",
"scale": "2km"
}
}
}
],
"query": {
"match": {
"properties": "balcony"
}
},
"score_mode": "multiply"
}
}
}
Next, we show how the computed score looks like for each of the three possible decay functions.
Normal decay, keyword gauss
When choosing gauss as the decay function in the above example, the
contour and surface plot of the multiplier looks like this:
Suppose your original search results matches three hotels :
-
"Backback Nap"
-
"Drink n Drive"
-
"BnB Bellevue".
"Drink n Drive" is pretty far from your defined location (nearly 2 km) and is not too cheap (about 13 Euros) so it gets a low factor a factor of 0.56. "BnB Bellevue" and "Backback Nap" are both pretty close to the defined location but "BnB Bellevue" is cheaper, so it gets a multiplier of 0.86 whereas "Backpack Nap" gets a value of 0.66.
111.5. Boosting Query
The boosting query can be used to effectively demote results that
match a given query. Unlike the "NOT" clause in bool query, this still
selects documents that contain undesirable terms, but reduces their
overall score.
{
"boosting" : {
"positive" : {
"term" : {
"field1" : "value1"
}
},
"negative" : {
"term" : {
"field2" : "value2"
}
},
"negative_boost" : 0.2
}
}
111.6. Indices Query
The indices query is useful in cases where a search is executed across
multiple indices. It allows to specify a list of index names and an inner
query that is only executed for indices matching names on that list.
For other indices that are searched, but that don’t match entries
on the list, the alternative no_match_query is executed.
{
"indices" : {
"indices" : ["index1", "index2"],
"query" : {
"term" : { "tag" : "wow" }
},
"no_match_query" : {
"term" : { "tag" : "kow" }
}
}
}
You can use the index field to provide a single index.
no_match_query can also have "string" value of none (to match no
documents), and all (to match all). Defaults to all.
query is mandatory, as well as indices (or index).
|
|
The fields order is important: if the |
111.7. And Query
deprecated[2.0.0-beta1, Use the bool query instead]
A query that matches documents using the AND boolean operator on other
queries.
{
"filtered" : {
"query" : {
"term" : { "name.first" : "shay" }
},
"filter" : {
"and" : [
{
"range" : {
"postDate" : {
"from" : "2010-03-01",
"to" : "2010-04-01"
}
}
},
{
"prefix" : { "name.second" : "ba" }
}
]
}
}
}
111.8. Not Query
deprecated[2.1.0, Use the bool query with must_not clause instead]
A query that filters out matched documents using a query. For example:
{
"bool" : {
"must" : {
"term" : { "name.first" : "shay" }
},
"filter" : {
"not" : {
"range" : {
"postDate" : {
"from" : "2010-03-01",
"to" : "2010-04-01"
}
}
}
}
}
}
Or, in a longer form with a filter element:
{
"bool" : {
"must" : {
"term" : { "name.first" : "shay" }
},
"filter" : {
"not" : {
"filter" : {
"range" : {
"postDate" : {
"from" : "2010-03-01",
"to" : "2010-04-01"
}
}
}
}
}
}
}
111.9. Or Query
deprecated[2.0.0-beta1, Use the bool query instead]
A query that matches documents using the OR boolean operator on other
queries.
{
"filtered" : {
"query" : {
"term" : { "name.first" : "shay" }
},
"filter" : {
"or" : [
{
"term" : { "name.second" : "banon" }
},
{
"term" : { "name.nick" : "kimchy" }
}
]
}
}
}
111.10. Filtered Query
deprecated[2.0.0-beta1, Use the bool query instead with a must clause for the query and a filter clause for the filter]
The filtered query is used to combine a query which will be used for
scoring with another query which will only be used for filtering the result
set.
|
|
Exclude as many document as you can with a filter, then query just the documents that remain. |
{
"filtered": {
"query": {
"match": { "tweet": "full text search" }
},
"filter": {
"range": { "created": { "gte": "now-1d/d" }}
}
}
}
The filtered query can be used wherever a query is expected, for instance,
to use the above example in search request:
curl -XGET localhost:9200/_search -d '
{
"query": {
"filtered": {
"query": {
"match": { "tweet": "full text search" }
},
"filter": {
"range": { "created": { "gte": "now-1d/d" }}
}
}
}
}
'
The filtered query is passed as the value of the query
parameter in the search request. |
111.10.1. Filtering without a query
If a query is not specified, it defaults to the
match_all query. This means that the
filtered query can be used to wrap just a filter, so that it can be used
wherever a query is expected.
curl -XGET localhost:9200/_search -d '
{
"query": {
"filtered": {
"filter": {
"range": { "created": { "gte": "now-1d/d" }}
}
}
}
}
'
No query has been specified, so this request applies just the filter,
returning all documents created since yesterday. |
Multiple filters
Multiple filters can be applied by wrapping them in a
bool query, for example:
{
"filtered": {
"query": { "match": { "tweet": "full text search" }},
"filter": {
"bool": {
"must": { "range": { "created": { "gte": "now-1d/d" }}},
"should": [
{ "term": { "featured": true }},
{ "term": { "starred": true }}
],
"must_not": { "term": { "deleted": false }}
}
}
}
}
112. Joining queries
Performing full SQL-style joins in a distributed system like Elasticsearch is prohibitively expensive. Instead, Elasticsearch offers two forms of join which are designed to scale horizontally.
nestedquery-
Documents may contains fields of type
nested. These fields are used to index arrays of objects, where each object can be queried (with thenestedquery) as an independent document. has_childandhas_parentqueries-
A parent-child relationship can exist between two document types within a single index. The
has_childquery returns parent documents whose child documents match the specified query, while thehas_parentquery returns child documents whose parent document matches the specified query.
Also see the terms-lookup mechanism in the terms
query, which allows you to build a terms query from values contained in
another document.
112.1. Nested Query
Nested query allows to query nested objects / docs (see nested mapping). The query is executed against the nested objects / docs as if they were indexed as separate docs (they are, internally) and resulting in the root parent doc (or parent nested mapping). Here is a sample mapping we will work with:
{
"type1" : {
"properties" : {
"obj1" : {
"type" : "nested"
}
}
}
}
And here is a sample nested query usage:
{
"nested" : {
"path" : "obj1",
"score_mode" : "avg",
"query" : {
"bool" : {
"must" : [
{
"match" : {"obj1.name" : "blue"}
},
{
"range" : {"obj1.count" : {"gt" : 5}}
}
]
}
}
}
}
The query path points to the nested object path, and the query
includes the query that will run on the nested docs matching
the direct path, and joining with the root parent docs. Note that any
fields referenced inside the query must use the complete path (fully
qualified).
The score_mode allows to set how inner children matching affects
scoring of parent. It defaults to avg, but can be sum, min,
max and none.
Multi level nesting is automatically supported, and detected, resulting in an inner nested query to automatically match the relevant nesting level (and not root) if it exists within another nested query.
112.2. Has Child Query
The has_child filter accepts a query and the child type to run against, and
results in parent documents that have child docs matching the query. Here is
an example:
{
"has_child" : {
"type" : "blog_tag",
"query" : {
"term" : {
"tag" : "something"
}
}
}
}
Scoring capabilities
The has_child also has scoring support. The
supported score modes are min, max, sum, avg or none. The default is
none and yields the same behaviour as in previous versions. If the
score mode is set to another value than none, the scores of all the
matching child documents are aggregated into the associated parent
documents. The score type can be specified with the score_mode field
inside the has_child query:
{
"has_child" : {
"type" : "blog_tag",
"score_mode" : "sum",
"query" : {
"term" : {
"tag" : "something"
}
}
}
}
Min/Max Children
The has_child query allows you to specify that a minimum and/or maximum
number of children are required to match for the parent doc to be considered
a match:
{
"has_child" : {
"type" : "blog_tag",
"score_mode" : "sum",
"min_children": 2,
"max_children": 10,
"query" : {
"term" : {
"tag" : "something"
}
}
}
}
Both min_children and max_children are optional. |
The min_children and max_children parameters can be combined with
the score_mode parameter.
112.3. Has Parent Query
The has_parent query accepts a query and a parent type. The query is
executed in the parent document space, which is specified by the parent
type. This query returns child documents which associated parents have
matched. For the rest has_parent query has the same options and works
in the same manner as the has_child query.
{
"has_parent" : {
"parent_type" : "blog",
"query" : {
"term" : {
"tag" : "something"
}
}
}
}
Scoring capabilities
The has_parent also has scoring support. The
supported score types are score or none. The default is none and
this ignores the score from the parent document. The score is in this
case equal to the boost on the has_parent query (Defaults to 1). If
the score type is set to score, then the score of the matching parent
document is aggregated into the child documents belonging to the
matching parent document. The score mode can be specified with the
score_mode field inside the has_parent query:
{
"has_parent" : {
"parent_type" : "blog",
"score_mode" : "score",
"query" : {
"term" : {
"tag" : "something"
}
}
}
}
113. Geo queries
Elasticsearch supports two types of geo data:
geo_point fields which support lat/lon pairs, and
geo_shape fields, which support points,
lines, circles, polygons, multi-polygons etc.
The queries in this group are:
geo_shapequery-
Find document with geo-shapes which either intersect, are contained by, or do not intersect with the specified geo-shape.
geo_bounding_boxquery-
Finds documents with geo-points that fall into the specified rectangle.
geo_distancequery-
Finds document with geo-points within the specified distance of a central point.
geo_distance_rangequery-
Like the
geo_pointquery, but the range starts at a specified distance from the central point. geo_polygonquery-
Find documents with geo-points within the specified polygon.
geohash_cellquery-
Find geo-points whose geohash intersects with the geohash of the specified point.
|
|
Percolating geo-queries in Elasticsearch 2.2.0 or later
The new See Percolating geo-queries in Elasticsearch 2.2.0 and later for a workaround. |
113.1. GeoShape Query
Filter documents indexed using the geo_shape type.
Requires the geo_shape Mapping.
The geo_shape query uses the same grid square representation as the
geo_shape mapping to find documents that have a shape that intersects
with the query shape. It will also use the same PrefixTree configuration
as defined for the field mapping.
The query supports two ways of defining the query shape, either by providing a whole shape definition, or by referencing the name of a shape pre-indexed in another index. Both formats are defined below with examples.
113.1.1. Inline Shape Definition
Similar to the geo_shape type, the geo_shape Filter uses
GeoJSON to represent shapes.
Given a document that looks like this:
{
"name": "Wind & Wetter, Berlin, Germany",
"location": {
"type": "Point",
"coordinates": [13.400544, 52.530286]
}
}
The following query will find the point using the Elasticsearch’s
envelope GeoJSON extension:
{
"query":{
"bool": {
"must": {
"match_all": {}
},
"filter": {
"geo_shape": {
"location": {
"shape": {
"type": "envelope",
"coordinates" : [[13.0, 53.0], [14.0, 52.0]]
},
"relation": "within"
}
}
}
}
}
}
113.1.2. Pre-Indexed Shape
The Query also supports using a shape which has already been indexed in another index and/or index type. This is particularly useful for when you have a pre-defined list of shapes which are useful to your application and you want to reference this using a logical name (for example New Zealand) rather than having to provide their coordinates each time. In this situation it is only necessary to provide:
-
id- The ID of the document that containing the pre-indexed shape. -
index- Name of the index where the pre-indexed shape is. Defaults to shapes. -
type- Index type where the pre-indexed shape is. -
path- The field specified as path containing the pre-indexed shape. Defaults to shape.
The following is an example of using the Filter with a pre-indexed shape:
{
"bool": {
"must": {
"match_all": {}
},
"filter": {
"geo_shape": {
"location": {
"indexed_shape": {
"id": "DEU",
"type": "countries",
"index": "shapes",
"path": "location"
}
}
}
}
}
}
113.1.3. Spatial Relations
The geo_shape strategy mapping parameter determines which spatial relation operators may be used at search time.
The following is a complete list of spatial relation operators available:
-
INTERSECTS- (default) Return all documents whosegeo_shapefield intersects the query geometry. -
DISJOINT- Return all documents whosegeo_shapefield has nothing in common with the query geometry. -
WITHIN- Return all documents whosegeo_shapefield is within the query geometry. -
CONTAINS- Return all documents whosegeo_shapefield contains the query geometry.
113.2. Geo Bounding Box Query
A query allowing to filter hits based on a point location using a bounding box. Assuming the following indexed document:
{
"pin" : {
"location" : {
"lat" : 40.12,
"lon" : -71.34
}
}
}
Then the following simple query can be executed with a
geo_bounding_box filter:
{
"bool" : {
"must" : {
"match_all" : {}
},
"filter" : {
"geo_bounding_box" : {
"pin.location" : {
"top_left" : {
"lat" : 40.73,
"lon" : -74.1
},
"bottom_right" : {
"lat" : 40.01,
"lon" : -71.12
}
}
}
}
}
}
Query Options
| Option | Description |
|---|---|
|
Optional name field to identify the filter |
|
Set to |
|
Set to one of |
Accepted Formats
In much the same way the geo_point type can accept different representation of the geo point, the filter can accept it as well:
Lat Lon As Properties
{
"bool" : {
"must" : {
"match_all" : {}
},
"filter" : {
"geo_bounding_box" : {
"pin.location" : {
"top_left" : {
"lat" : 40.73,
"lon" : -74.1
},
"bottom_right" : {
"lat" : 40.01,
"lon" : -71.12
}
}
}
}
}
}
Lat Lon As Array
Format in [lon, lat], note, the order of lon/lat here in order to
conform with GeoJSON.
{
"bool" : {
"must" : {
"match_all" : {}
},
"filter" : {
"geo_bounding_box" : {
"pin.location" : {
"top_left" : [-74.1, 40.73],
"bottom_right" : [-71.12, 40.01]
}
}
}
}
}
Lat Lon As String
Format in lat,lon.
{
"bool" : {
"must" : {
"match_all" : {}
},
"filter" : {
"geo_bounding_box" : {
"pin.location" : {
"top_left" : "40.73, -74.1",
"bottom_right" : "40.01, -71.12"
}
}
}
}
}
Geohash
{
"bool" : {
"must" : {
"match_all" : {}
},
"filter" : {
"geo_bounding_box" : {
"pin.location" : {
"top_left" : "dr5r9ydj2y73",
"bottom_right" : "drj7teegpus6"
}
}
}
}
}
Vertices
The vertices of the bounding box can either be set by top_left and
bottom_right or by top_right and bottom_left parameters. More
over the names topLeft, bottomRight, topRight and bottomLeft
are supported. Instead of setting the values pairwise, one can use
the simple names top, left, bottom and right to set the
values separately.
{
"bool" : {
"must" : {
"match_all" : {}
},
"filter" : {
"geo_bounding_box" : {
"pin.location" : {
"top" : -74.1,
"left" : 40.73,
"bottom" : -71.12,
"right" : 40.01
}
}
}
}
}
geo_point Type
The filter requires the geo_point type to be set on the relevant
field.
Multi Location Per Document
The filter can work with multiple locations / points per document. Once a single location / point matches the filter, the document will be included in the filter
Type
The type of the bounding box execution by default is set to memory,
which means in memory checks if the doc falls within the bounding box
range. In some cases, an indexed option will perform faster (but note
that the geo_point type must have lat and lon indexed in this case).
Note, when using the indexed option, multi locations per document field
are not supported. Here is an example:
{
"bool" : {
"must" : {
"match_all" : {}
},
"filter" : {
"geo_bounding_box" : {
"pin.location" : {
"top_left" : {
"lat" : 40.73,
"lon" : -74.1
},
"bottom_right" : {
"lat" : 40.10,
"lon" : -71.12
}
},
"type" : "indexed"
}
}
}
}
113.3. Geo Distance Query
Filters documents that include only hits that exists within a specific distance from a geo point. Assuming the following indexed json:
{
"pin" : {
"location" : {
"lat" : 40.12,
"lon" : -71.34
}
}
}
Then the following simple query can be executed with a geo_distance
filter:
{
"bool" : {
"must" : {
"match_all" : {}
},
"filter" : {
"geo_distance" : {
"distance" : "200km",
"pin.location" : {
"lat" : 40,
"lon" : -70
}
}
}
}
}
Accepted Formats
In much the same way the geo_point type can accept different
representation of the geo point, the filter can accept it as well:
Lat Lon As Properties
{
"bool" : {
"must" : {
"match_all" : {}
},
"filter" : {
"geo_distance" : {
"distance" : "12km",
"pin.location" : {
"lat" : 40,
"lon" : -70
}
}
}
}
}
Lat Lon As Array
Format in [lon, lat], note, the order of lon/lat here in order to
conform with GeoJSON.
{
"bool" : {
"must" : {
"match_all" : {}
},
"filter" : {
"geo_distance" : {
"distance" : "12km",
"pin.location" : [-70, 40]
}
}
}
}
Lat Lon As String
Format in lat,lon.
{
"bool" : {
"must" : {
"match_all" : {}
},
"filter" : {
"geo_distance" : {
"distance" : "12km",
"pin.location" : "40,-70"
}
}
}
}
Geohash
{
"bool" : {
"must" : {
"match_all" : {}
},
"filter" : {
"geo_distance" : {
"distance" : "12km",
"pin.location" : "drm3btev3e86"
}
}
}
}
Options
The following are options allowed on the filter:
distance
|
The radius of the circle centred on the specified location. Points which
fall into this circle are considered to be matches. The |
distance_type
|
How to compute the distance. Can either be |
optimize_bbox
|
Whether to use the optimization of first running a bounding box check
before the distance check. Defaults to |
_name
|
Optional name field to identify the query |
ignore_malformed
|
Set to |
geo_point Type
The filter requires the geo_point type to be set on the relevant
field.
Multi Location Per Document
The geo_distance filter can work with multiple locations / points per
document. Once a single location / point matches the filter, the
document will be included in the filter.
113.4. Geo Distance Range Query
Filters documents that exists within a range from a specific point:
{
"bool" : {
"must" : {
"match_all" : {}
},
"filter" : {
"geo_distance_range" : {
"from" : "200km",
"to" : "400km",
"pin.location" : {
"lat" : 40,
"lon" : -70
}
}
}
}
}
Supports the same point location parameter and query options as the geo_distance filter. And also support the common parameters for range (lt, lte, gt, gte, from, to, include_upper and include_lower).
113.5. Geo Polygon Query
A query allowing to include hits that only fall within a polygon of points. Here is an example:
{
"bool" : {
"query" : {
"match_all" : {}
},
"filter" : {
"geo_polygon" : {
"person.location" : {
"points" : [
{"lat" : 40, "lon" : -70},
{"lat" : 30, "lon" : -80},
{"lat" : 20, "lon" : -90}
]
}
}
}
}
}
Query Options
| Option | Description |
|---|---|
|
Optional name field to identify the filter |
|
Set to |
Allowed Formats
Lat Long as Array
Format in [lon, lat], note, the order of lon/lat here in order to
conform with GeoJSON.
{
"bool" : {
"must" : {
"match_all" : {}
},
"filter" : {
"geo_polygon" : {
"person.location" : {
"points" : [
[-70, 40],
[-80, 30],
[-90, 20]
]
}
}
}
}
}
Lat Lon as String
Format in lat,lon.
{
"bool" : {
"must" : {
"match_all" : {}
},
"filter" : {
"geo_polygon" : {
"person.location" : {
"points" : [
"40, -70",
"30, -80",
"20, -90"
]
}
}
}
}
}
Geohash
{
"bool" : {
"must" : {
"match_all" : {}
},
"filter" : {
"geo_polygon" : {
"person.location" : {
"points" : [
"drn5x1g8cu2y",
"30, -80",
"20, -90"
]
}
}
}
}
}
geo_point Type
The query requires the geo_point type to be set on the
relevant field.
113.6. Geohash Cell Query
The geohash_cell query provides access to a hierarchy of geohashes.
By defining a geohash cell, only geopoints
within this cell will match this filter.
To get this filter work all prefixes of a geohash need to be indexed. In
example a geohash u30 needs to be decomposed into three terms: u30,
u3 and u. This decomposition must be enabled in the mapping of the
geopoint field that’s going to be filtered by
setting the geohash_prefix option:
{
"mappings" : {
"location": {
"properties": {
"pin": {
"type": "geo_point",
"geohash": true,
"geohash_prefix": true,
"geohash_precision": 10
}
}
}
}
}
The geohash cell can defined by all formats of geo_points. If such a cell is
defined by a latitude and longitude pair the size of the cell needs to be
setup. This can be done by the precision parameter of the filter. This
parameter can be set to an integer value which sets the length of the geohash
prefix. Instead of setting a geohash length directly it is also possible to
define the precision as distance, in example "precision": "50m". (See
Distance Units.)
The neighbor option of the filter offers the possibility to filter cells
next to the given cell.
{
"bool" : {
"must" : {
"match_all" : {}
},
"filter" : {
"geohash_cell": {
"pin": {
"lat": 13.4080,
"lon": 52.5186
},
"precision": 3,
"neighbors": true
}
}
}
}
114. Specialized queries
This group contains queries which do not fit into the other groups:
more_like_thisquery-
This query finds documents which are similar to the specified text, document, or collection of documents.
templatequery-
The
templatequery accepts a Mustache template (either inline, indexed, or from a file), and a map of parameters, and combines the two to generate the final query to execute. scriptquery-
This query allows a script to act as a filter. Also see the
function_scorequery.
114.1. More Like This Query
The More Like This Query (MLT Query) finds documents that are "like" a given
set of documents. In order to do so, MLT selects a set of representative terms
of these input documents, forms a query using these terms, executes the query
and returns the results. The user controls the input documents, how the terms
should be selected and how the query is formed. more_like_this can be
shortened to mlt.
The simplest use case consists of asking for documents that are similar to a provided piece of text. Here, we are asking for all movies that have some text similar to "Once upon a time" in their "title" and in their "description" fields, limiting the number of selected terms to 12.
{
"more_like_this" : {
"fields" : ["title", "description"],
"like" : "Once upon a time",
"min_term_freq" : 1,
"max_query_terms" : 12
}
}
A more complicated use case consists of mixing texts with documents already existing in the index. In this case, the syntax to specify a document is similar to the one used in the Multi GET API.
{
"more_like_this" : {
"fields" : ["title", "description"],
"like" : [
{
"_index" : "imdb",
"_type" : "movies",
"_id" : "1"
},
{
"_index" : "imdb",
"_type" : "movies",
"_id" : "2"
},
"and potentially some more text here as well"
],
"min_term_freq" : 1,
"max_query_terms" : 12
}
}
Finally, users can mix some texts, a chosen set of documents but also provide documents not necessarily present in the index. To provide documents not present in the index, the syntax is similar to artificial documents.
{
"more_like_this" : {
"fields" : ["name.first", "name.last"],
"like" : [
{
"_index" : "marvel",
"_type" : "quotes",
"doc" : {
"name": {
"first": "Ben",
"last": "Grimm"
},
"tweet": "You got no idea what I'd... what I'd give to be invisible."
}
}
},
{
"_index" : "marvel",
"_type" : "quotes",
"_id" : "2"
}
],
"min_term_freq" : 1,
"max_query_terms" : 12
}
}
114.1.1. How it Works
Suppose we wanted to find all documents similar to a given input document.
Obviously, the input document itself should be its best match for that type of
query. And the reason would be mostly, according to
Lucene scoring formula,
due to the terms with the highest tf-idf. Therefore, the terms of the input
document that have the highest tf-idf are good representatives of that
document, and could be used within a disjunctive query (or OR) to retrieve similar
documents. The MLT query simply extracts the text from the input document,
analyzes it, usually using the same analyzer at the field, then selects the
top K terms with highest tf-idf to form a disjunctive query of these terms.
|
|
The fields on which to perform MLT must be indexed and of type
string. Additionally, when using like with documents, either _source
must be enabled or the fields must be stored or store term_vector. In
order to speed up analysis, it could help to store term vectors at index time.
|
For example, if we wish to perform MLT on the "title" and "tags.raw" fields,
we can explicitly store their term_vector at index time. We can still
perform MLT on the "description" and "tags" fields, as _source is enabled by
default, but there will be no speed up on analysis for these fields.
curl -s -XPUT 'http://localhost:9200/imdb/' -d '{
"mappings": {
"movies": {
"properties": {
"title": {
"type": "string",
"term_vector": "yes"
},
"description": {
"type": "string"
},
"tags": {
"type": "string",
"fields" : {
"raw": {
"type" : "string",
"index" : "not_analyzed",
"term_vector" : "yes"
}
}
}
}
}
}
}
114.1.2. Parameters
The only required parameter is like, all other parameters have sensible
defaults. There are three types of parameters: one to specify the document
input, the other one for term selection and for query formation.
Document Input Parameters
like
|
The only required parameter of the MLT query is |
unlike
|
The |
fields
|
A list of fields to fetch and analyze the text from. Defaults to the |
like_text
|
deprecated[2.0.0-beta1,Replaced by |
ids or docs
|
deprecated[2.0.0-beta1,Replaced by |
Term Selection Parameters
max_query_terms
|
The maximum number of query terms that will be selected. Increasing this value
gives greater accuracy at the expense of query execution speed. Defaults to
|
min_term_freq
|
The minimum term frequency below which the terms will be ignored from the
input document. Defaults to |
min_doc_freq
|
The minimum document frequency below which the terms will be ignored from the
input document. Defaults to |
max_doc_freq
|
The maximum document frequency above which the terms will be ignored from the
input document. This could be useful in order to ignore highly frequent words
such as stop words. Defaults to unbounded ( |
min_word_length
|
The minimum word length below which the terms will be ignored. The old name
|
max_word_length
|
The maximum word length above which the terms will be ignored. The old name
|
stop_words
|
An array of stop words. Any word in this set is considered "uninteresting" and ignored. If the analyzer allows for stop words, you might want to tell MLT to explicitly ignore them, as for the purposes of document similarity it seems reasonable to assume that "a stop word is never interesting". |
analyzer
|
The analyzer that is used to analyze the free form text. Defaults to the
analyzer associated with the first field in |
Query Formation Parameters
minimum_should_match
|
After the disjunctive query has been formed, this parameter controls the
number of terms that must match.
The syntax is the same as the minimum should match.
(Defaults to |
boost_terms
|
Each term in the formed query could be further boosted by their tf-idf score.
This sets the boost factor to use when using this feature. Defaults to
deactivated ( |
include
|
Specifies whether the input documents should also be included in the search
results returned. Defaults to |
boost
|
Sets the boost value of the whole query. Defaults to |
114.2. Template Query
A query that accepts a query template and a map of key/value pairs to fill in template parameters. Templating is based on Mustache. For simple token substitution all you provide is a query containing some variable that you want to substitute and the actual values:
GET /_search
{
"query": {
"template": {
"inline": { "match": { "text": "{{query_string}}" }},
"params" : {
"query_string" : "all about search"
}
}
}
}
The above request is translated into:
GET /_search
{
"query": {
"match": {
"text": "all about search"
}
}
}
Alternatively passing the template as an escaped string works as well:
GET /_search
{
"query": {
"template": {
"inline": "{ \"match\": { \"text\": \"{{query_string}}\" }}",
"params" : {
"query_string" : "all about search"
}
}
}
}
New line characters (\n) should be escaped as \\n or removed,
and quotes (") should be escaped as \\". |
114.2.1. Stored templates
You can register a template by storing it in the config/scripts directory, in a file using the .mustache extension.
In order to execute the stored template, reference it by name in the file
parameter:
GET /_search
{
"query": {
"template": {
"file": "my_template",
"params" : {
"query_string" : "all about search"
}
}
}
}
Name of the query template in config/scripts/, i.e., my_template.mustache. |
Alternatively, you can register a query template in the special .scripts index with:
PUT /_search/template/my_template
{
"template": { "match": { "text": "{{query_string}}" }},
}
and refer to it in the template query with the id parameter:
GET /_search
{
"query": {
"template": {
"id": "my_template",
"params" : {
"query_string" : "all about search"
}
}
}
}
Name of the query template in config/scripts/, i.e., storedTemplate.mustache. |
There is also a dedicated template endpoint, allows you to template an entire search request.
Please see Search Template for more details.
114.3. Script Query
A query allowing to define scripts as queries. They are typically used in a filter context, for example:
"bool" : {
"must" : {
...
},
"filter" : {
"script" : {
"script" : "doc['num1'].value > 1"
}
}
}
Custom Parameters
Scripts are compiled and cached for faster execution. If the same script can be used, just with different parameters provider, it is preferable to use the ability to pass parameters to the script itself, for example:
"bool" : {
"must" : {
...
},
"filter" : {
"script" : {
"script" : {
"inline" : "doc['num1'].value > param1"
"params" : {
"param1" : 5
}
}
}
}
}
115. Span queries
Span queries are low-level positional queries which provide expert control over the order and proximity of the specified terms. These are typically used to implement very specific queries on legal documents or patents.
Span queries cannot be mixed with non-span queries (with the exception of the span_multi query).
The queries in this group are:
span_termquery-
The equivalent of the
termquery but for use with other span queries. span_multiquery-
Wraps a
term,range,prefix,wildcard,regexp, orfuzzyquery. span_firstquery-
Accepts another span query whose matches must appear within the first N positions of the field.
span_nearquery-
Accepts multiple span queries whose matches must be within the specified distance of each other, and possibly in the same order.
span_orquery-
Combines multiple span queries — returns documents which match any of the specified queries.
span_notquery-
Wraps another span query, and excludes any documents which match that query.
span_containingquery-
Accepts a list of span queries, but only returns those spans which also match a second span query.
span_withinquery-
The result from a single span query is returned as long is its span falls within the spans returned by a list of other span queries.
115.1. Span Term Query
Matches spans containing a term. The span term query maps to Lucene
SpanTermQuery. Here is an example:
{
"span_term" : { "user" : "kimchy" }
}
A boost can also be associated with the query:
{
"span_term" : { "user" : { "value" : "kimchy", "boost" : 2.0 } }
}
Or :
{
"span_term" : { "user" : { "term" : "kimchy", "boost" : 2.0 } }
}
115.2. Span Multi Term Query
The span_multi query allows you to wrap a multi term query (one of wildcard,
fuzzy, prefix, term, range or regexp query) as a span query, so
it can be nested. Example:
{
"span_multi":{
"match":{
"prefix" : { "user" : { "value" : "ki" } }
}
}
}
A boost can also be associated with the query:
{
"span_multi":{
"match":{
"prefix" : { "user" : { "value" : "ki", "boost" : 1.08 } }
}
}
}
115.3. Span First Query
Matches spans near the beginning of a field. The span first query maps
to Lucene SpanFirstQuery. Here is an example:
{
"span_first" : {
"match" : {
"span_term" : { "user" : "kimchy" }
},
"end" : 3
}
}
The match clause can be any other span type query. The end controls
the maximum end position permitted in a match.
115.4. Span Near Query
Matches spans which are near one another. One can specify slop, the
maximum number of intervening unmatched positions, as well as whether
matches are required to be in-order. The span near query maps to Lucene
SpanNearQuery. Here is an example:
{
"span_near" : {
"clauses" : [
{ "span_term" : { "field" : "value1" } },
{ "span_term" : { "field" : "value2" } },
{ "span_term" : { "field" : "value3" } }
],
"slop" : 12,
"in_order" : false,
"collect_payloads" : false
}
}
The clauses element is a list of one or more other span type queries
and the slop controls the maximum number of intervening unmatched
positions permitted.
115.5. Span Or Query
Matches the union of its span clauses. The span or query maps to Lucene
SpanOrQuery. Here is an example:
{
"span_or" : {
"clauses" : [
{ "span_term" : { "field" : "value1" } },
{ "span_term" : { "field" : "value2" } },
{ "span_term" : { "field" : "value3" } }
]
}
}
The clauses element is a list of one or more other span type queries.
115.6. Span Not Query
Removes matches which overlap with another span query. The span not
query maps to Lucene SpanNotQuery. Here is an example:
{
"span_not" : {
"include" : {
"span_term" : { "field1" : "hoya" }
},
"exclude" : {
"span_near" : {
"clauses" : [
{ "span_term" : { "field1" : "la" } },
{ "span_term" : { "field1" : "hoya" } }
],
"slop" : 0,
"in_order" : true
}
}
}
}
The include and exclude clauses can be any span type query. The
include clause is the span query whose matches are filtered, and the
exclude clause is the span query whose matches must not overlap those
returned.
In the above example all documents with the term hoya are filtered except the ones that have la preceding them.
Other top level options:
pre
|
If set the amount of tokens before the include span can’t have overlap with the exclude span. |
post
|
If set the amount of tokens after the include span can’t have overlap with the exclude span. |
dist
|
If set the amount of tokens from within the include span can’t have overlap with the exclude span. Equivalent
of setting both |
115.7. Span Containing Query
Returns matches which enclose another span query. The span containing
query maps to Lucene SpanContainingQuery. Here is an example:
{
"span_containing" : {
"little" : {
"span_term" : { "field1" : "foo" }
},
"big" : {
"span_near" : {
"clauses" : [
{ "span_term" : { "field1" : "bar" } },
{ "span_term" : { "field1" : "baz" } }
],
"slop" : 5,
"in_order" : true
}
}
}
}
The big and little clauses can be any span type query. Matching
spans from big that contain matches from little are returned.
115.8. Span Within Query
Returns matches which are enclosed inside another span query. The span within
query maps to Lucene SpanWithinQuery. Here is an example:
{
"span_within" : {
"little" : {
"span_term" : { "field1" : "foo" }
},
"big" : {
"span_near" : {
"clauses" : [
{ "span_term" : { "field1" : "bar" } },
{ "span_term" : { "field1" : "baz" } }
],
"slop" : 5,
"in_order" : true
}
}
}
}
The big and little clauses can be any span type query. Matching
spans from little that are enclosed within big are returned.
116. Minimum Should Match
The minimum_should_match parameter possible values:
| Type | Example | Description |
|---|---|---|
Integer |
|
Indicates a fixed value regardless of the number of optional clauses. |
Negative integer |
|
Indicates that the total number of optional clauses, minus this number should be mandatory. |
Percentage |
|
Indicates that this percent of the total number of optional clauses are necessary. The number computed from the percentage is rounded down and used as the minimum. |
Negative percentage |
|
Indicates that this percent of the total number of optional clauses can be missing. The number computed from the percentage is rounded down, before being subtracted from the total to determine the minimum. |
Combination |
|
A positive integer, followed by the less-than symbol, followed by any of the previously mentioned specifiers is a conditional specification. It indicates that if the number of optional clauses is equal to (or less than) the integer, they are all required, but if it’s greater than the integer, the specification applies. In this example: if there are 1 to 3 clauses they are all required, but for 4 or more clauses only 90% are required. |
Multiple combinations |
|
Multiple conditional specifications can be separated by spaces, each one only being valid for numbers greater than the one before it. In this example: if there are 1 or 2 clauses both are required, if there are 3-9 clauses all but 25% are required, and if there are more than 9 clauses, all but three are required. |
NOTE:
When dealing with percentages, negative values can be used to get different behavior in edge cases. 75% and -25% mean the same thing when dealing with 4 clauses, but when dealing with 5 clauses 75% means 3 are required, but -25% means 4 are required.
If the calculations based on the specification determine that no optional clauses are needed, the usual rules about BooleanQueries still apply at search time (a BooleanQuery containing no required clauses must still match at least one optional clause)
No matter what number the calculation arrives at, a value greater than the number of optional clauses, or a value less than 1 will never be used. (ie: no matter how low or how high the result of the calculation result is, the minimum number of required matches will never be lower than 1 or greater than the number of clauses.
117. Multi Term Query Rewrite
Multi term queries, like
wildcard and
prefix are called
multi term queries and end up going through a process of rewrite. This
also happens on the
query_string.
All of those queries allow to control how they will get rewritten using
the rewrite parameter:
-
constant_score(default): A rewrite method that performs likeconstant_score_booleanwhen there are few matching terms and otherwise visits all matching terms in sequence and marks documents for that term. Matching documents are assigned a constant score equal to the query’s boost. -
scoring_boolean: A rewrite method that first translates each term into a should clause in a boolean query, and keeps the scores as computed by the query. Note that typically such scores are meaningless to the user, and require non-trivial CPU to compute, so it’s almost always better to useconstant_score_auto. This rewrite method will hit too many clauses failure if it exceeds the boolean query limit (defaults to1024). -
constant_score_boolean: Similar toscoring_booleanexcept scores are not computed. Instead, each matching document receives a constant score equal to the query’s boost. This rewrite method will hit too many clauses failure if it exceeds the boolean query limit (defaults to1024). -
top_terms_N: A rewrite method that first translates each term into should clause in boolean query, and keeps the scores as computed by the query. This rewrite method only uses the top scoring terms so it will not overflow boolean max clause count. TheNcontrols the size of the top scoring terms to use. -
top_terms_boost_N: A rewrite method that first translates each term into should clause in boolean query, but the scores are only computed as the boost. This rewrite method only uses the top scoring terms so it will not overflow the boolean max clause count. TheNcontrols the size of the top scoring terms to use. -
top_terms_blended_freqs_N: A rewrite method that first translates each term into should clause in boolean query, but all term queries compute scores as if they had the same frequency. In practice the frequency which is used is the maximum frequency of all matching terms. This rewrite method only uses the top scoring terms so it will not overflow boolean max clause count. TheNcontrols the size of the top scoring terms to use.
Mapping
Mapping is the process of defining how a document, and the fields it contains, are stored and indexed. For instance, use mappings to define:
-
which string fields should be treated as full text fields.
-
which fields contain numbers, dates, or geolocations.
-
whether the values of all fields in the document should be indexed into the catch-all
_allfield. -
the format of date values.
-
custom rules to control the mapping for dynamically added fields.
Mapping Types
Each index has one or more mapping types, which are used to divide the
documents in an index into logical groups. User documents might be stored in a
user type, and blog posts in a blogpost type.
Each mapping type has:
- Meta-fields
-
Meta-fields are used to customize how a document’s metadata associated is treated. Examples of meta-fields include the document’s
_index,_type,_id, and_sourcefields. - Fields or properties
-
Each mapping type contains a list of fields or
propertiespertinent to that type. Ausertype might containtitle,name, andagefields, while ablogposttype might containtitle,body,user_idandcreatedfields. Fields with the same name in different mapping types in the same index must have the same mapping.
Field datatypes
Each field has a data type which can be:
It is often useful to index the same field in different ways for different
purposes. For instance, a string field could be indexed as
an analyzed field for full-text search, and as a not_analyzed field for
sorting or aggregations. Alternatively, you could index a string field with
the standard analyzer, the
english analyzer, and the
french analyzer.
This is the purpose of multi-fields. Most datatypes support multi-fields
via the fields parameter.
Dynamic mapping
Fields and mapping types do not need to be defined before being used. Thanks
to dynamic mapping, new mapping types and new field names will be added
automatically, just by indexing a document. New fields can be added both to
the top-level mapping type, and to inner object and
nested fields.
The dynamic mapping rules can be configured to customise the mapping that is used for new types and new fields.
Explicit mappings
You know more about your data than Elasticsearch can guess, so while dynamic mapping can be useful to get started, at some point you will want to specify your own explicit mappings.
You can create mapping types and field mappings when you create an index, and you can add mapping types and fields to an existing index with the PUT mapping API.
Updating existing mappings
Other than where documented, existing type and field mappings cannot be updated. Changing the mapping would mean invalidating already indexed documents. Instead, you should create a new index with the correct mappings and reindex your data into that index.
Fields are shared across mapping types
Mapping types are used to group fields, but the fields in each mapping type are not independent of each other. Fields with:
-
the same name
-
in the same index
-
in different mapping types
-
map to the same field internally,
-
and must have the same mapping.
If a title field exists in both the user and blogpost mapping types, the
title fields must have exactly the same mapping in each type. The only
exceptions to this rule are the copy_to, dynamic, enabled,
ignore_above, include_in_all, and properties parameters, which may
have different settings per field.
Usually, fields with the same name also contain the same type of data, so
having the same mapping is not a problem. When conflicts do arise, these can
be solved by choosing more descriptive names, such as user_title and
blog_title.
Example mapping
A mapping for the example described above could be specified when creating the index, as follows:
PUT my_index
{
"mappings": {
"user": {
"_all": { "enabled": false },
"properties": {
"title": { "type": "string" },
"name": { "type": "string" },
"age": { "type": "integer" }
}
},
"blogpost": {
"properties": {
"title": { "type": "string" },
"body": { "type": "string" },
"user_id": {
"type": "string",
"index": "not_analyzed"
},
"created": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
}
}
}
}
}
Create an index called my_index. |
|
Add mapping types called user and blogpost. |
|
Disable the _all meta field for the user mapping type. |
|
| Specify fields or properties in each mapping type. | |
Specify the data type and mapping for each field. |
118. Field datatypes
Elasticsearch supports a number of different datatypes for the fields in a document:
Core datatypes
- String datatype
-
string - Numeric datatypes
-
long,integer,short,byte,double,float - Date datatype
-
date - Boolean datatype
-
boolean - Binary datatype
-
binary
Complex datatypes
- Array datatype
-
Array support does not require a dedicated
type - Object datatype
-
objectfor single JSON objects - Nested datatype
-
nestedfor arrays of JSON objects
Geo datatypes
- Geo-point datatype
-
geo_pointfor lat/lon points - Geo-Shape datatype
-
geo_shapefor complex shapes like polygons
Specialised datatypes
- IPv4 datatype
-
ipfor IPv4 addresses - Completion datatype
-
completionto provide auto-complete suggestions - Token count datatype
-
token_countto count the number of tokens in a string mapper-murmur3-
murmur3to compute hashes of values at index-time and store them in the index - Attachment datatype
-
See the
mapper-attachmentsplugin which supports indexingattachmentslike Microsoft Office formats, Open Document formats, ePub, HTML, etc. into anattachmentdatatype.
Multi-fields
It is often useful to index the same field in different ways for different
purposes. For instance, a string field could be indexed as
an analyzed field for full-text search, and as a not_analyzed field for
sorting or aggregations. Alternatively, you could index a string field with
the standard analyzer, the
english analyzer, and the
french analyzer.
This is the purpose of multi-fields. Most datatypes support multi-fields
via the fields parameter.
118.1. Array datatype
In Elasticsearch, there is no dedicated array type. Any field can contain
zero or more values by default, however, all values in the array must be of
the same datatype. For instance:
-
an array of strings: [
"one","two"] -
an array of integers: [
1,2] -
an array of arrays: [
1, [2,3]] which is the equivalent of [1,2,3] -
an array of objects: [
{ "name": "Mary", "age": 12 },{ "name": "John", "age": 10 }]
|
|
Arrays of objects
Arrays of objects do not work as you would expect: you cannot query each
object independently of the other objects in the array. If you need to be
able to do this then you should use the This is explained in more detail in Nested datatype. |
When adding a field dynamically, the first value in the array determines the
field type. All subsequent values must be of the same datatype or it must
at least be possible to coerce subsequent values to the same
datatype.
Arrays with a mixture of datatypes are not supported: [ 10, "some string" ]
An array may contain null values, which are either replaced by the
configured null_value or skipped entirely. An empty array
[] is treated as a missing field — a field with no values.
Nothing needs to be pre-configured in order to use arrays in documents, they are supported out of the box:
PUT my_index/my_type/1
{
"message": "some arrays in this document...",
"tags": [ "elasticsearch", "wow" ],
"lists": [
{
"name": "prog_list",
"description": "programming list"
},
{
"name": "cool_list",
"description": "cool stuff list"
}
]
}
PUT my_index/my_type/2
{
"message": "no arrays in this document...",
"tags": "elasticsearch",
"lists": {
"name": "prog_list",
"description": "programming list"
}
}
GET my_index/_search
{
"query": {
"match": {
"tags": "elasticsearch"
}
}
}
The tags field is dynamically added as a string field. |
|
The lists field is dynamically added as an object field. |
|
| The second document contains no arrays, but can be indexed into the same fields. | |
The query looks for elasticsearch in the tags field, and matches both documents. |
118.2. Binary datatype
The binary type accepts a binary value as a
Base64 encoded string. The field is not
stored by default and is not searchable:
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"name": {
"type": "string"
},
"blob": {
"type": "binary"
}
}
}
}
}
PUT my_index/my_type/1
{
"name": "Some binary blob",
"blob": "U29tZSBiaW5hcnkgYmxvYg=="
}
The Base64 encoded binary value must not have embedded newlines \n. |
118.2.1. Parameters for binary fields
The following parameters are accepted by binary fields:
doc_values
|
Should the field be stored on disk in a column-stride fashion, so that it
can later be used for sorting, aggregations, or scripting? Accepts |
store
|
Whether the field value should be stored and retrievable separately from
the |
118.3. Boolean datatype
Boolean fields accept JSON true and false values, but can also accept
strings and numbers which are interpreted as either true or false:
| False values |
|
| True values |
Anything that isn’t false. |
For example:
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"is_published": {
"type": "boolean"
}
}
}
}
}
POST my_index/my_type/1
{
"is_published": true
}
GET my_index/_search
{
"query": {
"term": {
"is_published": 1
}
}
}
Indexing a document with a JSON true. |
|
Querying for the document with 1, which is interpreted as true. |
Aggregations like the terms
aggregation use 1 and 0 for the key, and the strings "true" and
"false" for the key_as_string. Boolean fields when used in scripts,
return 1 and 0:
POST my_index/my_type/1
{
"is_published": true
}
POST my_index/my_type/2
{
"is_published": false
}
GET my_index/_search
{
"aggs": {
"publish_state": {
"terms": {
"field": "is_published"
}
}
},
"script_fields": {
"is_published": {
"script": "doc['is_published'].value"
}
}
}
| Inline scripts must be enabled for this example to work. |
118.3.1. Parameters for boolean fields
The following parameters are accepted by boolean fields:
boost
|
Field-level index time boosting. Accepts a floating point number, defaults
to |
doc_values
|
Should the field be stored on disk in a column-stride fashion, so that it
can later be used for sorting, aggregations, or scripting? Accepts |
index
|
Should the field be searchable? Accepts |
null_value
|
Accepts any of the true or false values listed above. The value is
substituted for any explicit |
store
|
Whether the field value should be stored and retrievable separately from
the |
118.4. Date datatype
JSON doesn’t have a date datatype, so dates in Elasticsearch can either be:
-
strings containing formatted dates, e.g.
"2015-01-01"or"2015/01/01 12:10:30". -
a long number representing milliseconds-since-the-epoch.
-
an integer representing seconds-since-the-epoch.
Internally, dates are converted to UTC (if the time-zone is specified) and stored as a long number representing milliseconds-since-the-epoch.
Date formats can be customised, but if no format is specified then it uses
the default:
"strict_date_optional_time||epoch_millis"
This means that it will accept dates with optional timestamps, which conform
to the formats supported by strict_date_optional_time
or milliseconds-since-the-epoch.
For instance:
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"date": {
"type": "date"
}
}
}
}
}
PUT my_index/my_type/1
{ "date": "2015-01-01" }
PUT my_index/my_type/2
{ "date": "2015-01-01T12:10:30Z" }
PUT my_index/my_type/3
{ "date": 1420070400001 }
GET my_index/_search
{
"sort": { "date": "asc"}
}
The date field uses the default format. |
|
| This document uses a plain date. | |
| This document includes a time. | |
| This document uses milliseconds-since-the-epoch. | |
Note that the sort values that are returned are all in milliseconds-since-the-epoch. |
118.4.1. Multiple date formats
Multiple formats can be specified by separating them with || as a separator.
Each format will be tried in turn until a matching format is found. The first
format will be used to convert the milliseconds-since-the-epoch value back
into a string.
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"date": {
"type": "date",
"format": "yyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
}
}
}
}
}
118.4.2. Parameters for date fields
The following parameters are accepted by date fields:
boost
|
Field-level index time boosting. Accepts a floating point number, defaults
to |
doc_values
|
Should the field be stored on disk in a column-stride fashion, so that it
can later be used for sorting, aggregations, or scripting? Accepts |
format
|
The date format(s) that can be parsed. Defaults to
|
ignore_malformed
|
If |
include_in_all
|
Whether or not the field value should be included in the
|
index
|
Should the field be searchable? Accepts |
null_value
|
Accepts a date value in one of the configured |
precision_step
|
Controls the number of extra terms that are indexed to make
|
store
|
Whether the field value should be stored and retrievable separately from
the |
118.5. Geo-point datatype
Fields of type geo_point accept latitude-longitude pairs, which can be used:
-
to find geo-points within a bounding box, within a certain distance of a central point, within a polygon, or within a geohash cell.
-
to aggregate documents by geographically or by distance from a central point.
-
to integrate distance into a document’s relevance score.
-
to sort documents by distance.
|
|
Percolating geo-queries in Elasticsearch 2.2.0 or later
The new See Percolating geo-queries in Elasticsearch 2.2.0 and later for a workaround. |
There are four ways that a geo-point may be specified, as demonstrated below:
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"location": {
"type": "geo_point"
}
}
}
}
}
PUT my_index/my_type/1
{
"text": "Geo-point as an object",
"location": {
"lat": 41.12,
"lon": -71.34
}
}
PUT my_index/my_type/2
{
"text": "Geo-point as a string",
"location": "41.12,-71.34"
}
PUT my_index/my_type/3
{
"text": "Geo-point as a geohash",
"location": "drm3btev3e86"
}
PUT my_index/my_type/4
{
"text": "Geo-point as an array",
"location": [ -71.34, 41.12 ]
}
GET my_index/_search
{
"query": {
"geo_bounding_box": {
"location": {
"top_left": {
"lat": 42,
"lon": -72
},
"bottom_right": {
"lat": 40,
"lon": -74
}
}
}
}
}
Geo-point expressed as an object, with lat and lon keys. |
|
Geo-point expressed as a string with the format: "lat,lon". |
|
| Geo-point expressed as a geohash. | |
Geo-point expressed as an array with the format: [ lon, lat] |
|
| A geo-bounding box query which finds all geo-points that fall inside the box. |
|
|
Geo-points expressed as an array or string
Please note that string geo-points are ordered as Originally, |
118.5.1. Parameters for geo_point fields
The following parameters are accepted by geo_point fields:
geohash
|
Should the geo-point also be indexed as a geohash in the |
geohash_precision
|
The maximum length of the geohash to use for the |
geohash_prefix
|
Should the geo-point also be indexed as a geohash plus all its prefixes?
Defaults to |
ignore_malformed
|
If |
lat_lon
|
Should the geo-point also be indexed as |
precision_step
|
Controls the number of extra terms that are indexed for each lat/lon point.
Defaults to |
118.5.2. Using geo-points in scripts
When accessing the value of a geo-point in a script, the value is returned as
a GeoPoint object, which allows access to the .lat and .lon values
respectively:
geopoint = doc['location'].value;
lat = geopoint.lat;
lon = geopoint.lon;
For performance reasons, it is better to access the lat/lon values directly:
lat = doc['location'].lat;
lon = doc['location'].lon;
118.6. Geo-Shape datatype
The geo_shape datatype facilitates the indexing of and searching
with arbitrary geo shapes such as rectangles and polygons. It should be
used when either the data being indexed or the queries being executed
contain shapes other than just points.
You can query documents using this type using geo_shape Query.
Mapping Options
The geo_shape mapping maps geo_json geometry objects to the geo_shape type. To enable it, users must explicitly map fields to the geo_shape type.
| Option | Description | Default |
|---|---|---|
|
Name of the PrefixTree implementation to be used: |
|
|
This parameter may be used instead of |
|
|
Maximum number of layers to be used by the PrefixTree.
This can be used to control the precision of shape representations and
therefore how many terms are indexed. Defaults to the default value of
the chosen PrefixTree implementation. Since this parameter requires a
certain level of understanding of the underlying implementation, users
may use the |
|
|
The strategy parameter defines the approach for how to
represent shapes at indexing and search time. It also influences the
capabilities available so it is recommended to let Elasticsearch set
this parameter automatically. There are two strategies available:
|
|
|
Used as a hint to the PrefixTree about how
precise it should be. Defaults to 0.025 (2.5%) with 0.5 as the maximum
supported value. PERFORMANCE NOTE: This value will default to 0 if a |
|
|
Optionally define how to interpret vertex order for
polygons / multipolygons. This parameter defines one of two coordinate
system rules (Right-hand or Left-hand) each of which can be specified in three
different ways. 1. Right-hand rule: |
|
|
Setting this option to |
|
Prefix trees
To efficiently represent shapes in the index, Shapes are converted into a series of hashes representing grid squares (commonly referred to as "rasters") using implementations of a PrefixTree. The tree notion comes from the fact that the PrefixTree uses multiple grid layers, each with an increasing level of precision to represent the Earth. This can be thought of as increasing the level of detail of a map or image at higher zoom levels.
Multiple PrefixTree implementations are provided:
-
GeohashPrefixTree - Uses geohashes for grid squares. Geohashes are base32 encoded strings of the bits of the latitude and longitude interleaved. So the longer the hash, the more precise it is. Each character added to the geohash represents another tree level and adds 5 bits of precision to the geohash. A geohash represents a rectangular area and has 32 sub rectangles. The maximum amount of levels in Elasticsearch is 24.
-
QuadPrefixTree - Uses a quadtree for grid squares. Similar to geohash, quad trees interleave the bits of the latitude and longitude the resulting hash is a bit set. A tree level in a quad tree represents 2 bits in this bit set, one for each coordinate. The maximum amount of levels for the quad trees in Elasticsearch is 50.
Spatial strategies
The PrefixTree implementations rely on a SpatialStrategy for decomposing the provided Shape(s) into approximated grid squares. Each strategy answers the following:
-
What type of Shapes can be indexed?
-
What types of Query Operations and Shapes can be used?
-
Does it support more than one Shape per field?
The following Strategy implementations (with corresponding capabilities) are provided:
| Strategy | Supported Shapes | Supported Queries | Multiple Shapes |
|---|---|---|---|
|
|
Yes |
|
|
|
Yes |
Accuracy
Geo_shape does not provide 100% accuracy and depending on how it is configured it may return some false positives or false negatives for certain queries. To mitigate this, it is important to select an appropriate value for the tree_levels parameter and to adjust expectations accordingly. For example, a point may be near the border of a particular grid cell and may thus not match a query that only matches the cell right next to it — even though the shape is very close to the point.
Example
{
"properties": {
"location": {
"type": "geo_shape",
"tree": "quadtree",
"precision": "1m"
}
}
}
This mapping maps the location field to the geo_shape type using the quad_tree implementation and a precision of 1m. Elasticsearch translates this into a tree_levels setting of 26.
Performance considerations
Elasticsearch uses the paths in the prefix tree as terms in the index and in queries. The higher the levels is (and thus the precision), the more terms are generated. Of course, calculating the terms, keeping them in memory, and storing them on disk all have a price. Especially with higher tree levels, indices can become extremely large even with a modest amount of data. Additionally, the size of the features also matters. Big, complex polygons can take up a lot of space at higher tree levels. Which setting is right depends on the use case. Generally one trades off accuracy against index size and query performance.
The defaults in Elasticsearch for both implementations are a compromise between index size and a reasonable level of precision of 50m at the equator. This allows for indexing tens of millions of shapes without overly bloating the resulting index too much relative to the input size.
Input Structure
| GeoJSON Type | Elasticsearch Type | Description |
|---|---|---|
|
|
A single geographic coordinate. |
|
|
An arbitrary line given two or more points. |
|
|
A closed polygon whose first and last point
must match, thus requiring |
|
|
An array of unconnected, but likely related points. |
|
|
An array of separate linestrings. |
|
|
An array of separate polygons. |
|
|
A GeoJSON shape similar to the
|
|
|
A bounding rectangle, or envelope, specified by specifying only the top left and bottom right points. |
|
|
A circle specified by a center point and radius with
units, which default to |
|
|
For all types, both the inner In GeoJSON, and therefore Elasticsearch, the correct coordinate order is longitude, latitude (X, Y) within coordinate arrays. This differs from many Geospatial APIs (e.g., Google Maps) that generally use the colloquial latitude, longitude (Y, X). |
Point
A point is a single geographic coordinate, such as the location of a building or the current position given by a smartphone’s Geolocation API.
{
"location" : {
"type" : "point",
"coordinates" : [-77.03653, 38.897676]
}
}
LineString
A linestring defined by an array of two or more positions. By
specifying only two points, the linestring will represent a straight
line. Specifying more than two points creates an arbitrary path.
{
"location" : {
"type" : "linestring",
"coordinates" : [[-77.03653, 38.897676], [-77.009051, 38.889939]]
}
}
The above linestring would draw a straight line starting at the White
House to the US Capitol Building.
Polygon
A polygon is defined by a list of a list of points. The first and last points in each (outer) list must be the same (the polygon must be closed).
{
"location" : {
"type" : "polygon",
"coordinates" : [
[ [100.0, 0.0], [101.0, 0.0], [101.0, 1.0], [100.0, 1.0], [100.0, 0.0] ]
]
}
}
The first array represents the outer boundary of the polygon, the other arrays represent the interior shapes ("holes"):
{
"location" : {
"type" : "polygon",
"coordinates" : [
[ [100.0, 0.0], [101.0, 0.0], [101.0, 1.0], [100.0, 1.0], [100.0, 0.0] ],
[ [100.2, 0.2], [100.8, 0.2], [100.8, 0.8], [100.2, 0.8], [100.2, 0.2] ]
]
}
}
IMPORTANT NOTE: GeoJSON does not mandate a specific order for vertices thus ambiguous polygons around the dateline and poles are possible. To alleviate ambiguity the Open Geospatial Consortium (OGC) Simple Feature Access specification defines the following vertex ordering:
-
Outer Ring - Counterclockwise
-
Inner Ring(s) / Holes - Clockwise
For polygons that do not cross the dateline, vertex order will not matter in Elasticsearch. For polygons that do cross the dateline, Elasticsearch requires vertex ordering to comply with the OGC specification. Otherwise, an unintended polygon may be created and unexpected query/filter results will be returned.
The following provides an example of an ambiguous polygon. Elasticsearch will apply OGC standards to eliminate ambiguity resulting in a polygon that crosses the dateline.
{
"location" : {
"type" : "polygon",
"coordinates" : [
[ [-177.0, 10.0], [176.0, 15.0], [172.0, 0.0], [176.0, -15.0], [-177.0, -10.0], [-177.0, 10.0] ],
[ [178.2, 8.2], [-178.8, 8.2], [-180.8, -8.8], [178.2, 8.8] ]
]
}
}
An orientation parameter can be defined when setting the geo_shape mapping (see Mapping Options). This will define vertex
order for the coordinate list on the mapped geo_shape field. It can also be overridden on each document. The following is an example for
overriding the orientation on a document:
{
"location" : {
"type" : "polygon",
"orientation" : "clockwise",
"coordinates" : [
[ [-177.0, 10.0], [176.0, 15.0], [172.0, 0.0], [176.0, -15.0], [-177.0, -10.0], [-177.0, 10.0] ],
[ [178.2, 8.2], [-178.8, 8.2], [-180.8, -8.8], [178.2, 8.8] ]
]
}
}
MultiPoint
A list of geojson points.
{
"location" : {
"type" : "multipoint",
"coordinates" : [
[102.0, 2.0], [103.0, 2.0]
]
}
}
MultiLineString
A list of geojson linestrings.
{
"location" : {
"type" : "multilinestring",
"coordinates" : [
[ [102.0, 2.0], [103.0, 2.0], [103.0, 3.0], [102.0, 3.0] ],
[ [100.0, 0.0], [101.0, 0.0], [101.0, 1.0], [100.0, 1.0] ],
[ [100.2, 0.2], [100.8, 0.2], [100.8, 0.8], [100.2, 0.8] ]
]
}
}
MultiPolygon
A list of geojson polygons.
{
"location" : {
"type" : "multipolygon",
"coordinates" : [
[ [[102.0, 2.0], [103.0, 2.0], [103.0, 3.0], [102.0, 3.0], [102.0, 2.0]] ],
[ [[100.0, 0.0], [101.0, 0.0], [101.0, 1.0], [100.0, 1.0], [100.0, 0.0]],
[[100.2, 0.2], [100.8, 0.2], [100.8, 0.8], [100.2, 0.8], [100.2, 0.2]] ]
]
}
}
Geometry Collection
A collection of geojson geometry objects.
{
"location" : {
"type": "geometrycollection",
"geometries": [
{
"type": "point",
"coordinates": [100.0, 0.0]
},
{
"type": "linestring",
"coordinates": [ [101.0, 0.0], [102.0, 1.0] ]
}
]
}
}
Envelope
Elasticsearch supports an envelope type, which consists of coordinates
for upper left and lower right points of the shape to represent a
bounding rectangle:
{
"location" : {
"type" : "envelope",
"coordinates" : [ [-45.0, 45.0], [45.0, -45.0] ]
}
}
Circle
Elasticsearch supports a circle type, which consists of a center
point with a radius:
{
"location" : {
"type" : "circle",
"coordinates" : [-45.0, 45.0],
"radius" : "100m"
}
}
Note: The inner radius field is required. If not specified, then
the units of the radius will default to METERS.
Sorting and Retrieving index Shapes
Due to the complex input structure and index representation of shapes,
it is not currently possible to sort shapes or retrieve their fields
directly. The geo_shape value is only retrievable through the _source
field.
118.7. IPv4 datatype
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"ip_addr": {
"type": "ip"
}
}
}
}
}
PUT my_index/my_type/1
{
"ip_addr": "192.168.1.1"
}
GET my_index/_search
{
"query": {
"range": {
"ip_addr": {
"gte": "192.168.1.0",
"lt": "192.168.2.0"
}
}
}
}
118.7.1. Parameters for ip fields
The following parameters are accepted by ip fields:
boost
|
Field-level index time boosting. Accepts a floating point number, defaults
to |
doc_values
|
Should the field be stored on disk in a column-stride fashion, so that it
can later be used for sorting, aggregations, or scripting? Accepts |
include_in_all
|
Whether or not the field value should be included in the
|
index
|
Should the field be searchable? Accepts |
null_value
|
Accepts an IPv4 value which is substituted for any explicit |
precision_step
|
Controls the number of extra terms that are indexed to make
|
store
|
Whether the field value should be stored and retrievable separately from
the |
|
|
IPv6 addresses are not supported yet. |
118.8. Nested datatype
The nested type is a specialised version of the object datatype
that allows arrays of objects to be indexed and queried independently of each
other.
118.8.1. How arrays of objects are flattened
Arrays of inner object fields do not work the way you may expect.
Lucene has no concept of inner objects, so Elasticsearch flattens object
hierarchies into a simple list of field names and values. For instance, the
following document:
PUT my_index/my_type/1
{
"group" : "fans",
"user" : [
{
"first" : "John",
"last" : "Smith"
},
{
"first" : "Alice",
"last" : "White"
}
]
}
The user field is dynamically added as a field of type object. |
would be transformed internally into a document that looks more like this:
{
"group" : "fans",
"user.first" : [ "alice", "john" ],
"user.last" : [ "smith", "white" ]
}
The user.first and user.last fields are flattened into multi-value fields,
and the association between alice and white is lost. This document would
incorrectly match a query for alice AND smith:
GET my_index/_search
{
"query": {
"bool": {
"must": [
{ "match": { "user.first": "Alice" }},
{ "match": { "user.last": "Smith" }}
]
}
}
}
118.8.2. Using nested fields for arrays of objects
If you need to index arrays of objects and to maintain the independence of
each object in the array, you should use the nested datatype instead of the
object datatype. Internally, nested objects index each object in
the array as a separate hidden document, meaning that each nested object can be
queried independently of the others, with the nested query:
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"user": {
"type": "nested"
}
}
}
}
}
PUT my_index/my_type/1
{
"group" : "fans",
"user" : [
{
"first" : "John",
"last" : "Smith"
},
{
"first" : "Alice",
"last" : "White"
}
]
}
GET my_index/_search
{
"query": {
"nested": {
"path": "user",
"query": {
"bool": {
"must": [
{ "match": { "user.first": "Alice" }},
{ "match": { "user.last": "Smith" }}
]
}
}
}
}
}
GET my_index/_search
{
"query": {
"nested": {
"path": "user",
"query": {
"bool": {
"must": [
{ "match": { "user.first": "Alice" }},
{ "match": { "user.last": "White" }}
]
}
},
"inner_hits": {
"highlight": {
"fields": {
"user.first": {}
}
}
}
}
}
The user field is mapped as type nested instead of type object. |
|
This query doesn’t match because Alice and Smith are not in the same nested object. |
|
This query matches because Alice and White are in the same nested object. |
|
inner_hits allow us to highlight the matching nested documents. |
Nested documents can be:
-
queried with the
nestedquery. -
analyzed with the
nestedandreverse_nestedaggregations. -
sorted with nested sorting.
-
retrieved and highlighted with nested inner hits.
118.8.3. Parameters for nested fields
The following parameters are accepted by nested fields:
dynamic
|
Whether or not new |
include_in_all
|
Sets the default |
properties
|
The fields within the nested object, which can be of any
datatype, including |
|
|
Because nested documents are indexed as separate documents, they can only be
accessed within the scope of the For instance, if a string field within a nested document has
|
118.8.4. Limiting the number of nested fields
Indexing a document with 100 nested fields actually indexes 101 documents as each nested
document is indexed as a separate document. To safeguard against ill-defined mappings
the number of nested fields that can be defined per index has been limited to 50. This
default limit can be changed with the index setting index.mapping.nested_fields.limit.
118.9. Numeric datatypes
The following numeric types are supported:
long
|
A signed 64-bit integer with a minimum value of |
integer
|
A signed 32-bit integer with a minimum value of |
short
|
A signed 16-bit integer with a minimum value of |
byte
|
A signed 8-bit integer with a minimum value of |
double
|
A double-precision 64-bit IEEE 754 floating point. |
float
|
A single-precision 32-bit IEEE 754 floating point. |
Below is an example of configuring a mapping with numeric fields:
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"number_of_bytes": {
"type": "integer"
},
"time_in_seconds": {
"type": "float"
}
}
}
}
}
118.9.1. Parameters for numeric fields
The following parameters are accepted by numeric types:
coerce
|
Try to convert strings to numbers and truncate fractions for integers.
Accepts |
boost
|
Field-level index time boosting. Accepts a floating point number, defaults
to |
doc_values
|
Should the field be stored on disk in a column-stride fashion, so that it
can later be used for sorting, aggregations, or scripting? Accepts |
ignore_malformed
|
If |
include_in_all
|
Whether or not the field value should be included in the
|
index
|
Should the field be searchable? Accepts |
null_value
|
Accepts a numeric value of the same |
precision_step
|
Controls the number of extra terms that are indexed to make
|
store
|
Whether the field value should be stored and retrievable separately from
the |
118.10. Object datatype
JSON documents are hierarchical in nature: the document may contain inner objects which, in turn, may contain inner objects themselves:
PUT my_index/my_type/1
{
"region": "US",
"manager": {
"age": 30,
"name": {
"first": "John",
"last": "Smith"
}
}
}
| The outer document is also a JSON object. | |
It contains an inner object called manager. |
|
Which in turn contains an inner object called name. |
Internally, this document is indexed as a simple, flat list of key-value pairs, something like this:
{
"region": "US",
"manager.age": 30,
"manager.name.first": "John",
"manager.name.last": "Smith"
}
An explicit mapping for the above document could look like this:
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"region": {
"type": "string",
"index": "not_analyzed"
},
"manager": {
"properties": {
"age": { "type": "integer" },
"name": {
"properties": {
"first": { "type": "string" },
"last": { "type": "string" }
}
}
}
}
}
}
}
}
The mapping type is a type of object, and has a properties field. |
|
The manager field is an inner object field. |
|
The manager.name field is an inner object field within the manager field. |
You are not required to set the field type to object explicitly, as this is the default value.
118.10.1. Parameters for object fields
The following parameters are accepted by object fields:
dynamic
|
Whether or not new |
enabled
|
Whether the JSON value given for the object field should be
parsed and indexed ( |
include_in_all
|
Sets the default |
properties
|
The fields within the object, which can be of any
datatype, including |
|
|
If you need to index arrays of objects instead of single objects, read Nested datatype first. |
118.11. String datatype
Fields of type string accept text values. Strings may be sub-divided into:
- Full text
-
Full text values, like the body of an email, are typically used for text based relevance searches, such as: Find the most relevant documents that match a query for "quick brown fox".
These fields are
analyzed, that is they are passed through an analyzer to convert the string into a list of individual terms before being indexed. The analysis process allows Elasticsearch to search for individual words within each full text field. Full text fields are not used for sorting and seldom used for aggregations (although the significant terms aggregation is a notable exception). - Keywords
-
Keywords are exact values like email addresses, hostnames, status codes, or tags. They are typically used for filtering (Find me all blog posts where
statusispublished), for sorting, and for aggregations. Keyword fields arenot_analyzed. Instead, the exact string value is added to the index as a single term.
Below is an example of a mapping for a full text (analyzed) and a keyword
(not_analyzed) string field:
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"full_name": {
"type": "string"
},
"status": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
The full_name field is an analyzed full text field — index:analyzed is the default. |
|
The status field is a not_analyzed keyword field. |
Sometimes it is useful to have both a full text (analyzed) and a keyword
(not_analyzed) version of the same field: one for full text search and the
other for aggregations and sorting. This can be achieved with
multi-fields.
118.11.1. Parameters for string fields
The following parameters are accepted by string fields:
analyzer
|
The analyzer which should be used for
|
boost
|
Field-level index time boosting. Accepts a floating point number, defaults
to |
doc_values
|
Should the field be stored on disk in a column-stride fashion, so that it
can later be used for sorting, aggregations, or scripting? Accepts |
fielddata
|
Can the field use in-memory fielddata for sorting, aggregations,
or scripting? Accepts |
fields
|
Multi-fields allow the same string value to be indexed in multiple ways for different purposes, such as one field for search and a multi-field for sorting and aggregations, or the same string value analyzed by different analyzers. |
ignore_above
|
Do not index or analyze any string longer than this value. Defaults to |
include_in_all
|
Whether or not the field value should be included in the
|
index
|
Should the field be searchable? Accepts |
index_options
|
What information should be stored in the index, for search and highlighting purposes.
Defaults to |
norms
|
Whether field-length should be taken into account when scoring queries.
Defaults depend on the
|
null_value
|
Accepts a string value which is substituted for any explicit |
position_increment_gap
|
The number of fake term positions which should be inserted between each element of an array of strings. Defaults to 0. The number of fake term position which should be inserted between each element of an array of strings. Defaults to the position_increment_gap configured on the analyzer which defaults to 100. 100 was chosen because it prevents phrase queries with reasonably large slops (less than 100) from matching terms across field values. |
store
|
Whether the field value should be stored and retrievable separately from
the |
search_analyzer
|
The |
search_quote_analyzer
|
The |
similarity
|
Which scoring algorithm or similarity should be used. Defaults
to |
term_vector
|
Whether term vectors should be stored for an |
118.12. Token count datatype
A field of type token_count is really an integer field which
accepts string values, analyzes them, then indexes the number of tokens in the
string.
For instance:
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"name": {
"type": "string",
"fields": {
"length": {
"type": "token_count",
"analyzer": "standard"
}
}
}
}
}
}
}
PUT my_index/my_type/1
{ "name": "John Smith" }
PUT my_index/my_type/2
{ "name": "Rachel Alice Williams" }
GET my_index/_search
{
"query": {
"term": {
"name.length": 3
}
}
}
The name field is an analyzed string field which uses the default standard analyzer. |
|
The name.length field is a token_count multi-field which will index the number of tokens in the name field. |
|
This query matches only the document containing Rachel Alice Williams, as it contains three tokens. |
|
|
Technically the |
118.12.1. Parameters for token_count fields
The following parameters are accepted by token_count fields:
analyzer
|
The analyzer which should be used to analyze the string value. Required. For best performance, use an analyzer without token filters. |
boost
|
Field-level index time boosting. Accepts a floating point number, defaults
to |
doc_values
|
Should the field be stored on disk in a column-stride fashion, so that it
can later be used for sorting, aggregations, or scripting? Accepts |
index
|
Should the field be searchable? Accepts |
include_in_all
|
Whether or not the field value should be included in the
|
null_value
|
Accepts a numeric value of the same |
precision_step
|
Controls the number of extra terms that are indexed to make
|
store
|
Whether the field value should be stored and retrievable separately from
the |
119. Meta-Fields
Each document has metadata associated with it, such as the _index, mapping
_type, and _id meta-fields. The behaviour of some of these meta-fields
can be customised when a mapping type is created.
Identity meta-fields
_index
|
The index to which the document belongs. |
_uid
|
A composite field consisting of the |
_type
|
The document’s mapping type. |
_id
|
The document’s ID. |
Document source meta-fields
_source-
The original JSON representing the body of the document.
_size-
The size of the
_sourcefield in bytes, provided by themapper-sizeplugin.
Indexing meta-fields
_all-
A catch-all field that indexes the values of all other fields.
_field_names-
All fields in the document which contain non-null values.
_timestamp-
A timestamp associated with the document, either specified manually or auto-generated.
_ttl-
How long a document should live before it is automatically deleted.
Routing meta-fields
Other meta-field
_meta-
Application specific metadata.
119.1. _all field
The _all field is a special catch-all field which concatenates the values
of all of the other fields into one big string, using space as a delimiter, which is then
analyzed and indexed, but not stored. This means that it can be
searched, but not retrieved.
The _all field allows you to search for values in documents without knowing
which field contains the value. This makes it a useful option when getting
started with a new dataset. For instance:
PUT my_index/user/1
{
"first_name": "John",
"last_name": "Smith",
"date_of_birth": "1970-10-24"
}
GET my_index/_search
{
"query": {
"match": {
"_all": "john smith 1970"
}
}
}
The _all field will contain the terms: [ "john", "smith", "1970", "10", "24" ] |
|
|
All values treated as strings
The It is important to note that the |
The _all field is just a string field, and accepts the same
parameters that other string fields accept, including analyzer,
term_vectors, index_options, and store.
The _all field can be useful, especially when exploring new data using
simple filtering. However, by concatenating field values into one big string,
the _all field loses the distinction between short fields (more relevant)
and long fields (less relevant). For use cases where search relevance is
important, it is better to query individual fields specifically.
The _all field is not free: it requires extra CPU cycles and uses more disk
space. If not needed, it can be completely disabled or
customised on a per-field basis.
119.1.1. Using the _all field in queries
The query_string and
simple_query_string queries query
the _all field by default, unless another field is specified:
GET _search
{
"query": {
"query_string": {
"query": "john smith 1970"
}
}
}
The same goes for the ?q= parameter in URI search
requests (which is rewritten to a query_string query internally):
GET _search?q=john+smith+1970
Other queries, such as the match and
term queries require you to specify
the _all field explicitly, as per the
first example.
119.1.2. Disabling the _all field
The _all field can be completely disabled per-type by setting enabled to
false:
PUT my_index
{
"mappings": {
"type_1": {
"properties": {...}
},
"type_2": {
"_all": {
"enabled": false
},
"properties": {...}
}
}
}
The _all field in type_1 is enabled. |
|
The _all field in type_2 is completely disabled. |
If the _all field is disabled, then URI search requests and the
query_string and simple_query_string queries will not be able to use it
for queries (see Using the _all field in queries). You can configure them to use a
different field with the index.query.default_field setting:
PUT my_index
{
"mappings": {
"my_type": {
"_all": {
"enabled": false
},
"properties": {
"content": {
"type": "string"
}
}
}
},
"settings": {
"index.query.default_field": "content"
},
}
The _all field is disabled for the my_type type. |
|
The query_string query will default to querying the content field in this index. |
119.1.3. Excluding fields from _all
Individual fields can be included or excluded from the _all field with the
include_in_all setting.
119.1.4. Index boosting and the _all field
Individual fields can be boosted at index time, with the boost
parameter. The _all field takes these boosts into account:
PUT myindex
{
"mappings": {
"mytype": {
"properties": {
"title": {
"type": "string",
"boost": 2
},
"content": {
"type": "string"
}
}
}
}
}
When querying the _all field, words that originated in the
title field are twice as relevant as words that originated in
the content field. |
|
|
Using index-time boosting with the _all field has a significant
impact on query performance. Usually the better solution is to query fields
individually, with optional query time boosting.
|
119.1.5. Custom _all fields
While there is only a single _all field per index, the copy_to
parameter allows the creation of multiple custom _all fields. For
instance, first_name and last_name fields can be combined together into
the full_name field:
PUT myindex
{
"mappings": {
"mytype": {
"properties": {
"first_name": {
"type": "string",
"copy_to": "full_name"
},
"last_name": {
"type": "string",
"copy_to": "full_name"
},
"full_name": {
"type": "string"
}
}
}
}
}
PUT myindex/mytype/1
{
"first_name": "John",
"last_name": "Smith"
}
GET myindex/_search
{
"query": {
"match": {
"full_name": "John Smith"
}
}
}
The first_name and last_name values are copied to the full_name field. |
119.1.6. Highlighting and the _all field
A field can only be used for highlighting if
the original string value is available, either from the
_source field or as a stored field.
The _all field is not present in the _source field and it is not stored by
default, and so cannot be highlighted. There are two options. Either
store the _all field or highlight the
original fields.
Store the _all field
If store is set to true, then the original field value is retrievable and
can be highlighted:
PUT myindex
{
"mappings": {
"mytype": {
"_all": {
"store": true
}
}
}
}
PUT myindex/mytype/1
{
"first_name": "John",
"last_name": "Smith"
}
GET _search
{
"query": {
"match": {
"_all": "John Smith"
}
},
"highlight": {
"fields": {
"_all": {}
}
}
}
Of course, storing the _all field will use significantly more disk space
and, because it is a combination of other fields, it may result in odd
highlighting results.
The _all field also accepts the term_vector and index_options
parameters, allowing the use of the fast vector highlighter and the postings
highlighter.
Highlight original fields
You can query the _all field, but use the original fields for highlighting as follows:
PUT myindex
{
"mappings": {
"mytype": {
"_all": {}
}
}
}
PUT myindex/mytype/1
{
"first_name": "John",
"last_name": "Smith"
}
GET _search
{
"query": {
"match": {
"_all": "John Smith"
}
},
"highlight": {
"fields": {
"*_name": {
"require_field_match": "false"
}
}
}
}
The query inspects the _all field to find matching documents. |
|
Highlighting is performed on the two name fields, which are available from the _source. |
|
The query wasn’t run against the name fields, so set require_field_match to false. |
119.2. _field_names field
The _field_names field indexes the names of every field in a document that
contains any value other than null. This field is used by the
exists and missing
queries to find documents that either have or don’t have any non-null value
for a particular field.
The value of the _field_name field is accessible in queries, aggregations, and
scripts:
# Example documents
PUT my_index/my_type/1
{
"title": "This is a document"
}
PUT my_index/my_type/1
{
"title": "This is another document",
"body": "This document has a body"
}
GET my_index/_search
{
"query": {
"terms": {
"_field_names": [ "title" ]
}
},
"aggs": {
"Field names": {
"terms": {
"field": "_field_names",
"size": 10
}
}
},
"script_fields": {
"Field names": {
"script": "doc['_field_names']"
}
}
}
119.3. _id field
Each document indexed is associated with a _type (see
Mapping Types) and an _id. The _id field is not
indexed as its value can be derived automatically from the
_uid field.
The value of the _id field is accessible in queries and scripts, but not
in aggregations or when sorting, where the _uid field
should be used instead:
# Example documents
PUT my_index/my_type/1
{
"text": "Document with ID 1"
}
PUT my_index/my_type/2
{
"text": "Document with ID 2"
}
GET my_index/_search
{
"query": {
"terms": {
"_id": [ "1", "2" ]
}
},
"script_fields": {
"UID": {
"script": "doc['_id']"
}
}
}
119.4. _index field
When performing queries across multiple indexes, it is sometimes desirable to
add query clauses that are associated with documents of only certain indexes.
The _index field allows matching on the index a document was indexed into.
Its value is accessible in term, or terms queries, aggregations,
scripts, and when sorting:
|
|
The _index is exposed as a virtual field — it is not added to the
Lucene index as a real field. This means that you can use the _index field
in a term or terms query (or any query that is rewritten to a term
query, such as the match, query_string or simple_query_string query),
but it does not support prefix, wildcard, regexp, or fuzzy queries.
|
# Example documents
PUT index_1/my_type/1
{
"text": "Document in index 1"
}
PUT index_2/my_type/2
{
"text": "Document in index 2"
}
GET index_1,index_2/_search
{
"query": {
"terms": {
"_index": ["index_1", "index_2"]
}
},
"aggs": {
"indices": {
"terms": {
"field": "_index",
"size": 10
}
}
},
"sort": [
{
"_index": {
"order": "asc"
}
}
],
"script_fields": {
"index_name": {
"script": "doc['_index']"
}
}
}
Querying on the _index field |
|
Aggregating on the _index field |
|
Sorting on the _index field |
|
Accessing the _index field in scripts (inline scripts must be enabled for this example to work) |
119.5. _meta field
Each mapping type can have custom meta data associated with it. These are not used at all by Elasticsearch, but can be used to store application-specific metadata, such as the class that a document belongs to:
PUT my_index
{
"mappings": {
"user": {
"_meta": {
"class": "MyApp::User",
"version": {
"min": "1.0",
"max": "1.3"
}
}
}
}
}
This _meta info can be retrieved with the
GET mapping API. |
The _meta field can be updated on an existing type using the
PUT mapping API.
119.6. _parent field
A parent-child relationship can be established between documents in the same index by making one mapping type the parent of another:
PUT my_index
{
"mappings": {
"my_parent": {},
"my_child": {
"_parent": {
"type": "my_parent"
}
}
}
}
PUT my_index/my_parent/1
{
"text": "This is a parent document"
}
PUT my_index/my_child/2?parent=1
{
"text": "This is a child document"
}
PUT my_index/my_child/3?parent=1
{
"text": "This is another child document"
}
GET my_index/my_parent/_search
{
"query": {
"has_child": {
"type": "my_child",
"query": {
"match": {
"text": "child document"
}
}
}
}
}
The my_parent type is parent to the my_child type. |
|
| Index a parent document. | |
| Index two child documents, specifying the parent document’s ID. | |
| Find all parent documents that have children which match the query. |
See the has_child and
has_parent queries,
the children aggregation,
and inner hits for more information.
The value of the _parent field is accessible in queries, aggregations, scripts,
and when sorting:
GET my_index/_search
{
"query": {
"terms": {
"_parent": [ "1" ]
}
},
"aggs": {
"parents": {
"terms": {
"field": "_parent",
"size": 10
}
}
},
"sort": [
{
"_parent": {
"order": "desc"
}
}
],
"script_fields": {
"parent": {
"script": "doc['_parent']"
}
}
}
Querying on the _parent field (also see the has_parent query and the has_child query) |
|
Aggregating on the _parent field (also see the children aggregation) |
|
Sorting on the _parent field |
|
Accessing the _parent field in scripts (inline scripts must be enabled for this example to work) |
119.6.1. Parent-child restrictions
-
The parent and child types must be different — parent-child relationships cannot be established between documents of the same type.
-
The
_parent.typesetting can only point to a type that doesn’t exist yet. This means that a type cannot become a parent type after it is has been created. -
Parent and child documents must be indexed on the same shard. The
parentID is used as the routing value for the child, to ensure that the child is indexed on the same shard as the parent. This means that the sameparentvalue needs to be provided when getting, deleting, or updating a child document.
119.6.2. Global ordinals
Parent-child uses global ordinals to speed up joins.
Global ordinals need to be rebuilt after any change to a shard. The more
parent id values are stored in a shard, the longer it takes to rebuild the
global ordinals for the _parent field.
Global ordinals, by default, are built lazily: the first parent-child query or
aggregation after a refresh will trigger building of global ordinals. This can
introduce a significant latency spike for your users. You can use
eager_global_ordinals to shift the cost of building global
ordinals from query time to refresh time, by mapping the _parent field as follows:
PUT my_index
{
"mappings": {
"my_parent": {},
"my_child": {
"_parent": {
"type": "my_parent",
"fielddata": {
"loading": "eager_global_ordinals"
}
}
}
}
}
The amount of heap used by global ordinals can be checked as follows:
# Per-index
GET _stats/fielddata?human&fields=_parent
# Per-node per-index
GET _nodes/stats/indices/fielddata?human&fields=_parent
119.7. _routing field
A document is routed to a particular shard in an index using the following formula:
shard_num = hash(_routing) % num_primary_shards
Custom routing patterns can be implemented by specifying a custom routing
value per document. For instance:
PUT my_index/my_type/1?routing=user1
{
"title": "This is a document"
}
GET my_index/my_type/1?routing=user1 
This document uses user1 as its routing value, instead of its ID. |
|
The same routing value needs to be provided when
getting, deleting, or updating
the document. |
The value of the _routing field is accessible in queries, aggregations, scripts,
and when sorting:
GET my_index/_search
{
"query": {
"terms": {
"_routing": [ "user1" ]
}
},
"aggs": {
"Routing values": {
"terms": {
"field": "_routing",
"size": 10
}
}
},
"sort": [
{
"_routing": {
"order": "desc"
}
}
],
"script_fields": {
"Routing value": {
"script": "doc['_routing']"
}
}
}
Querying on the _routing field (also see the ids query) |
|
Aggregating on the _routing field |
|
Sorting on the _routing field |
|
Accessing the _routing field in scripts (inline scripts must be enabled for this example to work) |
119.7.1. Searching with custom routing
Custom routing can reduce the impact of searches. Instead of having to fan out a search request to all the shards in an index, the request can be sent to just the shard that matches the specific routing value (or values):
GET my_index/_search?routing=user1,user2
{
"query": {
"match": {
"title": "document"
}
}
}
This search request will only be executed on the shards associated with the user1 and user2 routing values. |
119.7.2. Making a routing value required
When using custom routing, it is important to provide the routing value whenever indexing, getting, deleting, or updating a document.
Forgetting the routing value can lead to a document being indexed on more than
one shard. As a safeguard, the _routing field can be configured to make a
custom routing value required for all CRUD operations:
PUT my_index
{
"mappings": {
"my_type": {
"_routing": {
"required": true
}
}
}
}
PUT my_index/my_type/1
{
"text": "No routing value provided"
}
Routing is required for my_type documents. |
|
This index request throws a routing_missing_exception. |
119.7.3. Unique IDs with custom routing
When indexing documents specifying a custom _routing, the uniqueness of the
_id is not guaranteed across all of the shards in the index. In fact,
documents with the same _id might end up on different shards if indexed with
different _routing values.
It is up to the user to ensure that IDs are unique across the index.
119.8. _source field
The _source field contains the original JSON document body that was passed
at index time. The _source field itself is not indexed (and thus is not
searchable), but it is stored so that it can be returned when executing
fetch requests, like get or search.
119.8.1. Disabling the _source field
Though very handy to have around, the source field does incur storage overhead within the index. For this reason, it can be disabled as follows:
PUT tweets
{
"mappings": {
"tweet": {
"_source": {
"enabled": false
}
}
}
}
|
|
Think before disabling the
_source fieldUsers often disable the
|
|
|
If disk space is a concern, rather increase the
compression level instead of disabling the _source.
|
119.8.2. Including / Excluding fields from _source
An expert-only feature is the ability to prune the contents of the _source
field after the document has been indexed, but before the _source field is
stored.
|
|
Removing fields from the _source has similar downsides to disabling
_source, especially the fact that you cannot reindex documents from one
Elasticsearch index to another. Consider using
source filtering instead.
|
The includes/excludes parameters (which also accept wildcards) can be used
as follows:
PUT logs
{
"mappings": {
"event": {
"_source": {
"includes": [
"*.count",
"meta.*"
],
"excludes": [
"meta.description",
"meta.other.*"
]
}
}
}
}
PUT logs/event/1
{
"requests": {
"count": 10,
"foo": "bar"
},
"meta": {
"name": "Some metric",
"description": "Some metric description",
"other": {
"foo": "one",
"baz": "two"
}
}
}
GET logs/event/_search
{
"query": {
"match": {
"meta.other.foo": "one"
}
}
}
These fields will be removed from the stored _source field. |
|
We can still search on this field, even though it is not in the stored _source. |
119.9. _timestamp field
deprecated[2.0.0-beta2,The _timestamp field is deprecated. Instead, use a normal date field and set its value explicitly]
The _timestamp field, when enabled, allows a timestamp to be indexed and
stored with a document. The timestamp may be specified manually, generated
automatically, or set to a default value:
PUT my_index
{
"mappings": {
"my_type": {
"_timestamp": {
"enabled": true
}
}
}
}
PUT my_index/my_type/1?timestamp=2015-01-01
{ "text": "Timestamp as a formatted date" }
PUT my_index/my_type/2?timestamp=1420070400000
{ "text": "Timestamp as milliseconds since the epoch" }
PUT my_index/my_type/3
{ "text": "Autogenerated timestamp set to now()" }
Enable the _timestamp field with default settings. |
|
| Set the timestamp manually with a formatted date. | |
| Set the timestamp with milliseconds since the epoch. | |
| Auto-generates a timestamp with now(). |
The behaviour of the _timestamp field can be configured with the following parameters:
default-
A default value to be used if none is provided. Defaults to now().
format-
The date format (or formats) to use when parsing timestamps. Defaults to
epoch_millis||strictDateOptionalTime. ignore_missing-
If
true(default), replace missing timestamps with thedefaultvalue. Iffalse, throw an exception.
The value of the _timestamp field is accessible in queries, aggregations, scripts,
and when sorting:
GET my_index/_search
{
"query": {
"range": {
"_timestamp": {
"gte": "2015-01-01"
}
}
},
"aggs": {
"Timestamps": {
"terms": {
"field": "_timestamp",
"size": 10
}
}
},
"sort": [
{
"_timestamp": {
"order": "desc"
}
}
],
"script_fields": {
"Timestamp": {
"script": "doc['_timestamp']"
}
}
}
Querying on the _timestamp field |
|
Aggregating on the _timestamp field |
|
Sorting on the _timestamp field |
|
Accessing the _timestamp field in scripts (inline scripts must be enabled for this example to work) |
119.10. _ttl field
deprecated[2.0.0-beta2,The current _ttl implementation is deprecated and will be replaced with a different implementation in a future version]
Some types of documents, such as session data or special offers, come with an
expiration date. The _ttl field allows you to specify the minimum time a
document should live, after which time the document is deleted automatically.
|
|
Prefer index-per-timeframe to TTL
With TTL , expired documents first have to be marked as deleted then later purged from the index when segments are merged. For append-only time-based data such as log events, it is much more efficient to use an index-per-day / week / month instead of TTLs. Old log data can be removed by simply deleting old indices. |
The _ttl field may be enabled as follows:
PUT my_index
{
"mappings": {
"my_type": {
"_ttl": {
"enabled": true
}
}
}
}
PUT my_index/my_type/1?ttl=10m
{
"text": "Will expire in 10 minutes"
}
PUT my_index/my_type/2
{
"text": "Will not expire"
}
| This document will expire 10 minutes after being indexed. | |
| This document has no TTL set and will not expire. |
The expiry time is calculated as the value of the
_timestamp field (or now() if the _timestamp
is not enabled) plus the ttl specified in the indexing request.
119.10.1. Default TTL
You can provide a default _ttl, which will be applied to indexing requests where the ttl is not specified:
PUT my_index
{
"mappings": {
"my_type": {
"_ttl": {
"enabled": true,
"default": "5m"
}
}
}
}
PUT my_index/my_type/1?ttl=10m
{
"text": "Will expire in 10 minutes"
}
PUT my_index/my_type/2
{
"text": "Will expire in 5 minutes"
}
| This document will expire 10 minutes after being indexed. | |
| This document has no TTL set and so will expire after the default 5 minutes. |
The default value can use time units like d for days, and
will use ms as the default unit if no time unit is provided.
You can dynamically update the default value using the put mapping
API. It won’t change the _ttl of already indexed documents but will be
used for future documents.
119.10.2. Note on documents expiration
Expired documents will be automatically deleted periodically. The following settings control the expiry process:
indices.ttl.interval-
How often the purge process should run. Defaults to
60s. Expired documents may still be retrieved before they are purged. indices.ttl.bulk_size-
How many deletions are handled by a single
bulkrequest. The default value is10000.
119.11. _type field
Each document indexed is associated with a _type (see
Mapping Types) and an _id. The _type field is
indexed in order to make searching by type name fast.
The value of the _type field is accessible in queries, aggregations,
scripts, and when sorting:
# Example documents
PUT my_index/type_1/1
{
"text": "Document with type 1"
}
PUT my_index/type_2/2
{
"text": "Document with type 2"
}
GET my_index/_search/type_*
{
"query": {
"terms": {
"_type": [ "type_1", "type_2" ]
}
},
"aggs": {
"types": {
"terms": {
"field": "_type",
"size": 10
}
}
},
"sort": [
{
"_type": {
"order": "desc"
}
}
],
"script_fields": {
"type": {
"script": "doc['_type']"
}
}
}
Querying on the _type field |
|
Aggregating on the _type field |
|
Sorting on the _type field |
|
Accessing the _type field in scripts (inline scripts must be enabled for this example to work) |
119.12. _uid field
Each document indexed is associated with a _type (see
Mapping Types) and an _id. These values are
combined as {type}#{id} and indexed as the _uid field.
The value of the _uid field is accessible in queries, aggregations, scripts,
and when sorting:
# Example documents
PUT my_index/my_type/1
{
"text": "Document with ID 1"
}
PUT my_index/my_type/2
{
"text": "Document with ID 2"
}
GET my_index/_search
{
"query": {
"terms": {
"_uid": [ "my_type#1", "my_type#2" ]
}
},
"aggs": {
"UIDs": {
"terms": {
"field": "_uid",
"size": 10
}
}
},
"sort": [
{
"_uid": {
"order": "desc"
}
}
],
"script_fields": {
"UID": {
"script": "doc['_uid']"
}
}
}
120. Mapping parameters
The following pages provide detailed explanations of the various mapping parameters that are used by field mappings:
The following mapping parameters are common to some or all field datatypes:
120.1. analyzer
The values of analyzed string fields are passed through an
analyzer to convert the string into a stream of tokens or
terms. For instance, the string "The quick Brown Foxes." may, depending
on which analyzer is used, be analyzed to the tokens: quick, brown,
fox. These are the actual terms that are indexed for the field, which makes
it possible to search efficiently for individual words within big blobs of
text.
This analysis process needs to happen not just at index time, but also at query time: the query string needs to be passed through the same (or a similar) analyzer so that the terms that it tries to find are in the same format as those that exist in the index.
Elasticsearch ships with a number of pre-defined analyzers, which can be used without further configuration. It also ships with many character filters, tokenizers, and Token Filters which can be combined to configure custom analyzers per index.
Analyzers can be specified per-query, per-field or per-index. At index time, Elasticsearch will look for an analyzer in this order:
-
The
analyzerdefined in the field mapping. -
An analyzer named
defaultin the index settings. -
The
standardanalyzer.
At query time, there are a few more layers:
-
The
analyzerdefined in a full-text query. -
The
search_analyzerdefined in the field mapping. -
The
analyzerdefined in the field mapping. -
An analyzer named
default_searchin the index settings. -
An analyzer named
defaultin the index settings. -
The
standardanalyzer.
The easiest way to specify an analyzer for a particular field is to define it in the field mapping, as follows:
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"text": {
"type": "string",
"fields": {
"english": {
"type": "string",
"analyzer": "english"
}
}
}
}
}
}
}
GET my_index/_analyze?field=text
{
"text": "The quick Brown Foxes."
}
GET my_index/_analyze?field=text.english
{
"text": "The quick Brown Foxes."
}
The text field uses the default standard analyzer`. |
|
The text.english multi-field uses the english analyzer, which removes stop words and applies stemming. |
|
This returns the tokens: [ the, quick, brown, foxes ]. |
|
This returns the tokens: [ quick, brown, fox ]. |
120.1.1. search_quote_analyzer
The search_quote_analyzer setting allows you to specify an analyzer for phrases, this is particularly useful when dealing with disabling
stop words for phrase queries.
To disable stop words for phrases a field utilising three analyzer settings will be required:
-
An
analyzersetting for indexing all terms including stop words -
A
search_analyzersetting for non-phrase queries that will remove stop words -
A
search_quote_analyzersetting for phrase queries that will not remove stop words
PUT /my_index
{
"settings":{
"analysis":{
"analyzer":{
"my_analyzer":{
"type":"custom",
"tokenizer":"standard",
"filter":[
"lowercase"
]
},
"my_stop_analyzer":{
"type":"custom",
"tokenizer":"standard",
"filter":[
"lowercase",
"english_stop"
]
}
},
"filter":{
"english_stop":{
"type":"stop",
"stopwords":"_english_"
}
}
}
},
"mappings":{
"my_type":{
"properties":{
"title": {
"type":"string",
"analyzer":"my_analyzer",
"search_analyzer":"my_stop_analyzer",
"search_quote_analyzer":"my_analyzer"
}
}
}
}
}
}
PUT my_index/my_type/1
{
"title":"The Quick Brown Fox"
}
PUT my_index/my_type/2
{
"title":"A Quick Brown Fox"
}
GET my_index/my_type/_search
{
"query":{
"query_string":{
"query":"\"the quick brown fox\""
}
}
}
my_analyzer analyzer which tokens all terms including stop words |
|
my_stop_analyzer analyzer which removes stop words |
|
analyzer setting that points to the my_analyzer analyzer which will be used at index time |
|
search_analyzer setting that points to the my_stop_analyzer and removes stop words for non-phrase queries |
|
search_quote_analyzer setting that points to the my_analyzer analyzer and ensures that stop words are not removed from phrase queries |
|
Since the query is wrapped in quotes it is detected as a phrase query therefore the search_quote_analyzer kicks in and ensures the stop words
are not removed from the query. The my_analyzer analyzer will then return the following tokens [the, quick, brown, fox] which will match one
of the documents. Meanwhile term queries will be analyzed with the my_stop_analyzer analyzer which will filter out stop words. So a search for either
The quick brown fox or A quick brown fox will return both documents since both documents contain the following tokens [quick, brown, fox].
Without the search_quote_analyzer it would not be possible to do exact matches for phrase queries as the stop words from phrase queries would be
removed resulting in both documents matching. |
120.2. boost
Individual fields can be boosted — count more towards the relevance score — at index time, with the boost parameter as follows:
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"title": {
"type": "string",
"boost": 2
},
"content": {
"type": "string"
}
}
}
}
}
Matches on the title field will have twice the weight as those on the
content field, which has the default boost of 1.0. |
Note that a title field will usually be shorter than a content field. The
default relevance calculation takes field length into account, so a short
title field will have a higher natural boost than a long content field.
|
|
Why index time boosting is a bad idea
We advise against using index time boosting for the following reasons:
|
The only advantage that index time boosting has is that it is copied with the
value into the _all field. This means that, when
querying the _all field, words that originated from the title field will
have a higher score than words that originated in the content field.
This functionality comes at a cost: queries on the _all field are slower
when index-time boosting is used.
120.3. coerce
Data is not always clean. Depending on how it is produced a number might be
rendered in the JSON body as a true JSON number, e.g. 5, but it might also
be rendered as a string, e.g. "5". Alternatively, a number that should be
an integer might instead be rendered as a floating point, e.g. 5.0, or even
"5.0".
Coercion attempts to clean up dirty values to fit the datatype of a field. For instance:
-
Strings will be coerced to numbers.
-
Floating points will be truncated for integer values.
-
Lon/lat geo-points will be normalized to a standard -180:180 / -90:90 coordinate system.
For instance:
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"number_one": {
"type": "integer"
},
"number_two": {
"type": "integer",
"coerce": false
}
}
}
}
}
PUT my_index/my_type/1
{
"number_one": "10"
}
PUT my_index/my_type/2
{
"number_two": "10"
}
The number_one field will contain the integer 10. |
|
| This document will be rejected because coercion is disabled. |
|
|
The coerce setting is allowed to have different settings for fields of
the same name in the same index. Its value can be updated on existing fields
using the PUT mapping API.
|
120.3.1. Index-level default
The index.mapping.coerce setting can be set on the index level to disable
coercion globally across all mapping types:
PUT my_index
{
"settings": {
"index.mapping.coerce": false
},
"mappings": {
"my_type": {
"properties": {
"number_one": {
"type": "integer"
},
"number_two": {
"type": "integer",
"coerce": true
}
}
}
}
}
PUT my_index/my_type/1
{ "number_one": "10" }
PUT my_index/my_type/2
{ "number_two": "10" } 
This document will be rejected because the number_one field inherits the index-level coercion setting. |
|
The number_two field overrides the index level setting to enable coercion. |
120.4. copy_to
The copy_to parameter allows you to create custom
_all fields. In other words, the values of multiple
fields can be copied into a group field, which can then be queried as a single
field. For instance, the first_name and last_name fields can be copied to
the full_name field as follows:
PUT /my_index
{
"mappings": {
"my_type": {
"properties": {
"first_name": {
"type": "string",
"copy_to": "full_name"
},
"last_name": {
"type": "string",
"copy_to": "full_name"
},
"full_name": {
"type": "string"
}
}
}
}
}
PUT /my_index/my_type/1
{
"first_name": "John",
"last_name": "Smith"
}
GET /my_index/_search
{
"query": {
"match": {
"full_name": {
"query": "John Smith",
"operator": "and"
}
}
}
}
The values of the first_name and last_name fields are copied to the
full_name field. |
|
The first_name and last_name fields can still be queried for the
first name and last name respectively, but the full_name field can be
queried for both first and last names. |
Some important points:
-
It is the field value which is copied, not the terms (which result from the analysis process).
-
The original
_sourcefield will not be modified to show the copied values. -
The same value can be copied to multiple fields, with
"copy_to": [ "field_1", "field_2" ]
120.5. doc_values
Most fields are indexed by default, which makes them searchable. The inverted index allows queries to look up the search term in unique sorted list of terms, and from that immediately have access to the list of documents that contain the term.
Sorting, aggregations, and access to field values in scripts requires a different data access pattern. Instead of looking up the term and finding documents, we need to be able to look up the document and find the terms that it has in a field.
Doc values are the on-disk data structure, built at document index time, which
makes this data access pattern possible. They store the same values as the
_source but in a column-oriented fashion that is way more efficient for
sorting and aggregations. Doc values are supported on almost all field types,
with the notable exception of analyzed string fields.
All fields which support doc values have them enabled by default. If you are sure that you don’t need to sort or aggregate on a field, or access the field value from a script, you can disable doc values in order to save disk space:
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"status_code": {
"type": "string",
"index": "not_analyzed"
},
"session_id": {
"type": "string",
"index": "not_analyzed",
"doc_values": false
}
}
}
}
}
The status_code field has doc_values enabled by default. |
|
The session_id has doc_values disabled, but can still be queried. |
|
|
The doc_values setting is allowed to have different settings for fields
of the same name in the same index. It can be disabled (set to false) on
existing fields using the PUT mapping API.
|
120.6. dynamic
By default, fields can be added dynamically to a document, or to inner objects within a document, just by indexing a document containing the new field. For instance:
DELETE my_index
PUT my_index/my_type/1
{
"username": "johnsmith",
"name": {
"first": "John",
"last": "Smith"
}
}
GET my_index/_mapping
PUT my_index/my_type/2
{
"username": "marywhite",
"email": "mary@white.com",
"name": {
"first": "Mary",
"middle": "Alice",
"last": "White"
}
}
GET my_index/_mapping 
| First delete the index, in case it already exists. | |
This document introduces the string field username, the object field
name, and two string fields under the name object which can be
referred to as name.first and name.last. |
|
| Check the mapping to verify the above. | |
This document adds two string fields: email and name.middle. |
|
| Check the mapping to verify the changes. |
The details of how new fields are detected and added to the mapping is explained in Dynamic Mapping.
The dynamic setting controls whether new fields can be added dynamically or
not. It accepts three settings:
true
|
Newly detected fields are added to the mapping. (default) |
false
|
Newly detected fields are ignored. New fields must be added explicitly. |
strict
|
If new fields are detected, an exception is thrown and the document is rejected. |
The dynamic setting may be set at the mapping type level, and on each
inner object. Inner objects inherit the setting from their parent
object or from the mapping type. For instance:
PUT my_index
{
"mappings": {
"my_type": {
"dynamic": false,
"properties": {
"user": {
"properties": {
"name": {
"type": "string"
},
"social_networks": {
"dynamic": true,
"properties": {}
}
}
}
}
}
}
}
| Dynamic mapping is disabled at the type level, so no new top-level fields will be added dynamically. | |
The user object inherits the type-level setting. |
|
The user.social_networks object enables dynamic mapping, so new fields may be added to this inner object. |
|
|
The dynamic setting is allowed to have different settings for fields of
the same name in the same index. Its value can be updated on existing fields
using the PUT mapping API.
|
120.7. enabled
Elasticsearch tries to index all of the fields you give it, but sometimes you want to just store the field without indexing it. For instance, imagine that you are using Elasticsearch as a web session store. You may want to index the session ID and last update time, but you don’t need to query or run aggregations on the session data itself.
The enabled setting, which can be applied only to the mapping type and to
object fields, causes Elasticsearch to skip parsing of the
contents of the field entirely. The JSON can still be retrieved from the
_source field, but it is not searchable or stored
in any other way:
PUT my_index
{
"mappings": {
"session": {
"properties": {
"user_id": {
"type": "string",
"index": "not_analyzed"
},
"last_updated": {
"type": "date"
},
"session_data": {
"enabled": false
}
}
}
}
}
PUT my_index/session/session_1
{
"user_id": "kimchy",
"session_data": {
"arbitrary_object": {
"some_array": [ "foo", "bar", { "baz": 2 } ]
}
},
"last_updated": "2015-12-06T18:20:22"
}
PUT my_index/session/session_2
{
"user_id": "jpountz",
"session_data": "none",
"last_updated": "2015-12-06T18:22:13"
}
The session_data field is disabled. |
|
Any arbitrary data can be passed to the session_data field as it will be entirely ignored. |
|
The session_data will also ignore values that are not JSON objects. |
The entire mapping type may be disabled as well, in which case the document is
stored in the _source field, which means it can be
retrieved, but none of its contents are indexed in any way:
PUT my_index
{
"mappings": {
"session": {
"enabled": false
}
}
}
PUT my_index/session/session_1
{
"user_id": "kimchy",
"session_data": {
"arbitrary_object": {
"some_array": [ "foo", "bar", { "baz": 2 } ]
}
},
"last_updated": "2015-12-06T18:20:22"
}
GET my_index/session/session_1
GET my_index/_mapping 
The entire session mapping type is disabled. |
|
| The document can be retrieved. | |
| Checking the mapping reveals that no fields have been added. |
|
|
The enabled setting is allowed to have different settings for fields of
the same name in the same index. Its value can be updated on existing fields
using the PUT mapping API.
|
120.8. fielddata
Most fields are indexed by default, which makes them searchable. The inverted index allows queries to look up the search term in unique sorted list of terms, and from that immediately have access to the list of documents that contain the term.
Sorting, aggregations, and access to field values in scripts requires a different data access pattern. Instead of lookup up the term and finding documents, we need to be able to look up the document and find the terms that it has in a field.
Most fields can use index-time, on-disk doc_values to support
this type of data access pattern, but analyzed string fields do not support
doc_values.
Instead, analyzed strings use a query-time data structure called
fielddata. This data structure is built on demand the first time that a
field is used for aggregations, sorting, or is accessed in a script. It is built
by reading the entire inverted index for each segment from disk, inverting the
term ↔︎ document relationship, and storing the result in memory, in the
JVM heap.
Loading fielddata is an expensive process so, once it has been loaded, it remains in memory for the lifetime of the segment.
|
|
Fielddata can fill up your heap space
Fielddata can consume a lot of heap space, especially when loading high
cardinality |
|
|
The fielddata.* settings must have the same settings for fields of the
same name in the same index. Its value can be updated on existing fields
using the PUT mapping API.
|
120.8.1. fielddata.format
For analyzed string fields, the fielddata format controls whether
fielddata should be enabled or not. It accepts: disabled and paged_bytes
(enabled, which is the default). To disable fielddata loading, you can use
the following mapping:
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"text": {
"type": "string",
"fielddata": {
"format": "disabled"
}
}
}
}
}
}
The text field cannot be used for sorting, aggregations, or in scripts. |
|
|
Fielddata and other datatypes
Historically, other field datatypes also used fielddata, but this has been replaced
by index-time, disk-based |
120.8.2. fielddata.loading
This per-field setting controls when fielddata is loaded into memory. It accepts three options:
lazy
|
Fielddata is only loaded into memory when it is needed. (default) |
eager
|
Fielddata is loaded into memory before a new search segment becomes visible to search. This can reduce the latency that a user may experience if their search request has to trigger lazy loading from a big segment. |
eager_global_ordinals
|
Loading fielddata into memory is only part of the work that is required. After loading the fielddata for each segment, Elasticsearch builds the Global ordinals data structure to make a list of all unique terms across all the segments in a shard. By default, global ordinals are built lazily. If the field has a very high cardinality, global ordinals may take some time to build, in which case you can use eager loading instead. |
120.8.3. fielddata.filter
Fielddata filtering can be used to reduce the number of terms loaded into memory, and thus reduce memory usage. Terms can be filtered by frequency or by regular expression, or a combination of the two:
- Filtering by frequency
-
The frequency filter allows you to only load terms whose term frequency falls between a
minandmaxvalue, which can be expressed an absolute number (when the number is bigger than 1.0) or as a percentage (eg0.01is1%and1.0is100%). Frequency is calculated per segment. Percentages are based on the number of docs which have a value for the field, as opposed to all docs in the segment.Small segments can be excluded completely by specifying the minimum number of docs that the segment should contain with
min_segment_size:PUT my_index { "mappings": { "my_type": { "properties": { "tag": { "type": "string", "fielddata": { "filter": { "frequency": { "min": 0.001, "max": 0.1, "min_segment_size": 500 } } } } } } } } - Filtering by regex
-
Terms can also be filtered by regular expression - only values which match the regular expression are loaded. Note: the regular expression is applied to each term in the field, not to the whole field value. For instance, to only load hashtags from a tweet, we can use a regular expression which matches terms beginning with
#:PUT my_index { "mappings": { "my_type": { "properties": { "tweet": { "type": "string", "analyzer": "whitespace", "fielddata": { "filter": { "regex": { "pattern": "^#.*" } } } } } } } }
These filters can be updated on an existing field mapping and will take effect the next time the fielddata for a segment is loaded. Use the Clear Cache API to reload the fielddata using the new filters.
120.9. format
In JSON documents, dates are represented as strings. Elasticsearch uses a set of preconfigured formats to recognize and parse these strings into a long value representing milliseconds-since-the-epoch in UTC.
Besides the built-in formats, your own
custom formats can be specified using the familiar
yyyy/MM/dd syntax:
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"date": {
"type": "date",
"format": "yyyy-MM-dd"
}
}
}
}
}
Many APIs which support date values also support date math
expressions, such as now-1m/d — the current time, minus one month, rounded
down to the nearest day.
|
|
The format setting must have the same setting for fields of the same
name in the same index. Its value can be updated on existing fields using the
PUT mapping API.
|
120.9.1. Custom date formats
Completely customizable date formats are supported. The syntax for these is explained in the Joda docs.
120.9.2. Built In Formats
Most of the below dates have a strict companion dates, which means, that
year, month and day parts of the week must have prepending zeros in order
to be valid. This means, that a date like 5/11/1 would not be valid, but
you would need to specify the full date, which would be 2005/11/01 in this
example. So instead of date_optional_time you would need to specify
strict_date_optional_time.
The following tables lists all the defaults ISO formats supported:
epoch_millis-
A formatter for the number of milliseconds since the epoch. Note, that this timestamp allows a max length of 13 chars, so only dates between 1653 and 2286 are supported. You should use a different date formatter in that case.
epoch_second-
A formatter for the number of seconds since the epoch. Note, that this timestamp allows a max length of 10 chars, so only dates between 1653 and 2286 are supported. You should use a different date formatter in that case.
date_optional_timeorstrict_date_optional_time-
A generic ISO datetime parser where the date is mandatory and the time is optional. Full details here.
basic_date-
A basic formatter for a full date as four digit year, two digit month of year, and two digit day of month:
yyyyMMdd. basic_date_time-
A basic formatter that combines a basic date and time, separated by a T:
yyyyMMdd'T'HHmmss.SSSZ. basic_date_time_no_millis-
A basic formatter that combines a basic date and time without millis, separated by a T:
yyyyMMdd'T'HHmmssZ. basic_ordinal_date-
A formatter for a full ordinal date, using a four digit year and three digit dayOfYear:
yyyyDDD. basic_ordinal_date_time-
A formatter for a full ordinal date and time, using a four digit year and three digit dayOfYear:
yyyyDDD'T'HHmmss.SSSZ. basic_ordinal_date_time_no_millis-
A formatter for a full ordinal date and time without millis, using a four digit year and three digit dayOfYear:
yyyyDDD'T'HHmmssZ. basic_time-
A basic formatter for a two digit hour of day, two digit minute of hour, two digit second of minute, three digit millis, and time zone offset:
HHmmss.SSSZ. basic_time_no_millis-
A basic formatter for a two digit hour of day, two digit minute of hour, two digit second of minute, and time zone offset:
HHmmssZ. basic_t_time-
A basic formatter for a two digit hour of day, two digit minute of hour, two digit second of minute, three digit millis, and time zone off set prefixed by T:
'T'HHmmss.SSSZ. basic_t_time_no_millis-
A basic formatter for a two digit hour of day, two digit minute of hour, two digit second of minute, and time zone offset prefixed by T:
'T'HHmmssZ. basic_week_dateorstrict_basic_week_date-
A basic formatter for a full date as four digit weekyear, two digit week of weekyear, and one digit day of week:
xxxx'W'wwe. basic_week_date_timeorstrict_basic_week_date_time-
A basic formatter that combines a basic weekyear date and time, separated by a T:
xxxx'W'wwe'T'HHmmss.SSSZ. basic_week_date_time_no_millisorstrict_basic_week_date_time_no_millis-
A basic formatter that combines a basic weekyear date and time without millis, separated by a T:
xxxx'W'wwe'T'HHmmssZ. dateorstrict_date-
A formatter for a full date as four digit year, two digit month of year, and two digit day of month:
yyyy-MM-dd. date_hourorstrict_date_hour-
A formatter that combines a full date and two digit hour of day.
date_hour_minuteorstrict_date_hour_minute-
A formatter that combines a full date, two digit hour of day, and two digit minute of hour.
date_hour_minute_secondorstrict_date_hour_minute_second-
A formatter that combines a full date, two digit hour of day, two digit minute of hour, and two digit second of minute.
date_hour_minute_second_fractionorstrict_date_hour_minute_second_fraction-
A formatter that combines a full date, two digit hour of day, two digit minute of hour, two digit second of minute, and three digit fraction of second:
yyyy-MM-dd'T'HH:mm:ss.SSS. date_hour_minute_second_millisorstrict_date_hour_minute_second_millis-
A formatter that combines a full date, two digit hour of day, two digit minute of hour, two digit second of minute, and three digit fraction of second:
yyyy-MM-dd'T'HH:mm:ss.SSS. date_timeorstrict_date_time-
A formatter that combines a full date and time, separated by a T:
yyyy- MM-dd'T'HH:mm:ss.SSSZZ. date_time_no_millisorstrict_date_time_no_millis-
A formatter that combines a full date and time without millis, separated by a T:
yyyy-MM-dd'T'HH:mm:ssZZ. hourorstrict_hour-
A formatter for a two digit hour of day.
hour_minuteorstrict_hour_minute-
A formatter for a two digit hour of day and two digit minute of hour.
hour_minute_secondorstrict_hour_minute_second-
A formatter for a two digit hour of day, two digit minute of hour, and two digit second of minute.
hour_minute_second_fractionorstrict_hour_minute_second_fraction-
A formatter for a two digit hour of day, two digit minute of hour, two digit second of minute, and three digit fraction of second:
HH:mm:ss.SSS. hour_minute_second_millisorstrict_hour_minute_second_millis-
A formatter for a two digit hour of day, two digit minute of hour, two digit second of minute, and three digit fraction of second:
HH:mm:ss.SSS. ordinal_dateorstrict_ordinal_date-
A formatter for a full ordinal date, using a four digit year and three digit dayOfYear:
yyyy-DDD. ordinal_date_timeorstrict_ordinal_date_time-
A formatter for a full ordinal date and time, using a four digit year and three digit dayOfYear:
yyyy-DDD'T'HH:mm:ss.SSSZZ. ordinal_date_time_no_millisorstrict_ordinal_date_time_no_millis-
A formatter for a full ordinal date and time without millis, using a four digit year and three digit dayOfYear:
yyyy-DDD'T'HH:mm:ssZZ. timeorstrict_time-
A formatter for a two digit hour of day, two digit minute of hour, two digit second of minute, three digit fraction of second, and time zone offset:
HH:mm:ss.SSSZZ. time_no_millisorstrict_time_no_millis-
A formatter for a two digit hour of day, two digit minute of hour, two digit second of minute, and time zone offset:
HH:mm:ssZZ. t_timeorstrict_t_time-
A formatter for a two digit hour of day, two digit minute of hour, two digit second of minute, three digit fraction of second, and time zone offset prefixed by T:
'T'HH:mm:ss.SSSZZ. t_time_no_millisorstrict_t_time_no_millis-
A formatter for a two digit hour of day, two digit minute of hour, two digit second of minute, and time zone offset prefixed by T:
'T'HH:mm:ssZZ. week_dateorstrict_week_date-
A formatter for a full date as four digit weekyear, two digit week of weekyear, and one digit day of week:
xxxx-'W'ww-e. week_date_timeorstrict_week_date_time-
A formatter that combines a full weekyear date and time, separated by a T:
xxxx-'W'ww-e'T'HH:mm:ss.SSSZZ. week_date_time_no_millisorstrict_week_date_time_no_millis-
A formatter that combines a full weekyear date and time without millis, separated by a T:
xxxx-'W'ww-e'T'HH:mm:ssZZ. weekyearorstrict_weekyear-
A formatter for a four digit weekyear.
weekyear_weekorstrict_weekyear_week-
A formatter for a four digit weekyear and two digit week of weekyear.
weekyear_week_dayorstrict_weekyear_week_day-
A formatter for a four digit weekyear, two digit week of weekyear, and one digit day of week.
yearorstrict_year-
A formatter for a four digit year.
year_monthorstrict_year_month-
A formatter for a four digit year and two digit month of year.
year_month_dayorstrict_year_month_day-
A formatter for a four digit year, two digit month of year, and two digit day of month.
120.10. geohash
Geohashes are a form of lat/lon encoding which divides the earth up into a grid. Each cell in this grid is represented by a geohash string. Each cell in turn can be further subdivided into smaller cells which are represented by a longer string. So the longer the geohash, the smaller (and thus more accurate) the cell is.
Because geohashes are just strings, they can be stored in an inverted index like any other string, which makes querying them very efficient.
If you enable the geohash option, a geohash “sub-field” will be indexed
as, eg .geohash. The length of the geohash is controlled by the
geohash_precision parameter.
If the geohash_prefix option is enabled, the geohash
option will be enabled automatically.
For example:
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"location": {
"type": "geo_point",
"geohash": true
}
}
}
}
}
PUT my_index/my_type/1
{
"location": {
"lat": 41.12,
"lon": -71.34
}
}
GET my_index/_search?fielddata_fields=location.geohash
{
"query": {
"prefix": {
"location.geohash": "drm3b"
}
}
}
A location.geohash field will be indexed for each geo-point. |
|
The geohash can be retrieved with doc_values. |
|
A prefix query can find all geohashes which start with a particular prefix. |
|
|
A |
120.11. geohash_precision
Geohashes are a form of lat/lon encoding which divides the earth up into a grid. Each cell in this grid is represented by a geohash string. Each cell in turn can be further subdivided into smaller cells which are represented by a longer string. So the longer the geohash, the smaller (and thus more accurate) the cell is.
The geohash_precision setting controls the length of the geohash that is
indexed when the geohash option is enabled, and the maximum
geohash length when the geohash_prefix option is enabled.
It accepts:
-
a number between 1 and 12 (default), which represents the length of the geohash.
-
a distance, e.g.
1km.
If a distance is specified, it will be translated to the smallest geohash-length that will provide the requested resolution.
For example:
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"location": {
"type": "geo_point",
"geohash_prefix": true,
"geohash_precision": 6
}
}
}
}
}
PUT my_index/my_type/1
{
"location": {
"lat": 41.12,
"lon": -71.34
}
}
GET my_index/_search?fielddata_fields=location.geohash
{
"query": {
"term": {
"location.geohash": "drm3bt"
}
}
}
A geohash_precision of 6 equates to geohash cells of approximately 1.26km x 0.6km |
120.12. geohash_prefix
Geohashes are a form of lat/lon encoding which divides the earth up into a grid. Each cell in this grid is represented by a geohash string. Each cell in turn can be further subdivided into smaller cells which are represented by a longer string. So the longer the geohash, the smaller (and thus more accurate) the cell is.
While the geohash option enables indexing the geohash that
corresponds to the lat/lon point, at the specified
precision, the geohash_prefix option will also
index all the enclosing cells as well.
For instance, a geohash of drm3btev3e86 will index all of the following
terms: [ d, dr, drm, drm3, drm3b, drm3bt, drm3bte, drm3btev,
drm3btev3, drm3btev3e, drm3btev3e8, drm3btev3e86 ].
The geohash prefixes can be used with the
geohash_cell query to find points within a
particular geohash, or its neighbours:
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"location": {
"type": "geo_point",
"geohash_prefix": true,
"geohash_precision": 6
}
}
}
}
}
PUT my_index/my_type/1
{
"location": {
"lat": 41.12,
"lon": -71.34
}
}
GET my_index/_search?fielddata_fields=location.geohash
{
"query": {
"geohash_cell": {
"location": {
"lat": 41.02,
"lon": -71.48
},
"precision": 4,
"neighbors": true
}
}
}
120.13. ignore_above
Strings longer than the ignore_above setting will not be processed by the
analyzer and will not be indexed. This is mainly useful for
not_analyzed string fields, which are typically used for
filtering, aggregations, and sorting. These are structured fields and it
doesn’t usually make sense to allow very long terms to be indexed in these
fields.
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"message": {
"type": "string",
"index": "not_analyzed",
"ignore_above": 20
}
}
}
}
}
PUT my_index/my_type/1
{
"message": "Syntax error"
}
PUT my_index/my_type/2
{
"message": "Syntax error with some long stacktrace"
}
GET _search
{
"aggs": {
"messages": {
"terms": {
"field": "message"
}
}
}
}
| This field will ignore any string longer than 20 characters. | |
| This document is indexed successfully. | |
This document will be indexed, but without indexing the message field. |
|
| Search returns both documents, but only the first is present in the terms aggregation. |
|
|
The ignore_above setting is allowed to have different settings for
fields of the same name in the same index. Its value can be updated on
existing fields using the PUT mapping API.
|
This option is also useful for protecting against Lucene’s term byte-length
limit of 32766.
|
|
The value for ignore_above is the character count, but Lucene counts
bytes. If you use UTF-8 text with many non-ASCII characters, you may want to
set the limit to 32766 / 3 = 10922 since UTF-8 characters may occupy at most
3 bytes.
|
120.14. ignore_malformed
Sometimes you don’t have much control over the data that you receive. One
user may send a login field that is a date, and another sends a
login field that is an email address.
Trying to index the wrong datatype into a field throws an exception by
default, and rejects the whole document. The ignore_malformed parameter, if
set to true, allows the exception to be ignored. The malformed field is not
indexed, but other fields in the document are processed normally.
For example:
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"number_one": {
"type": "integer"
},
"number_two": {
"type": "integer",
"ignore_malformed": true
}
}
}
}
}
PUT my_index/my_type/1
{
"text": "Some text value",
"number_one": "foo"
}
PUT my_index/my_type/2
{
"text": "Some text value",
"number_two": "foo"
}
This document will be rejected because number_one does not allow malformed values. |
|
This document will have the text field indexed, but not the number_two field. |
|
|
The ignore_malformed setting is allowed to have different settings for
fields of the same name in the same index. Its value can be updated on
existing fields using the PUT mapping API.
|
120.14.1. Index-level default
The index.mapping.ignore_malformed setting can be set on the index level to
allow to ignore malformed content globally across all mapping types.
PUT my_index
{
"settings": {
"index.mapping.ignore_malformed": true
},
"mappings": {
"my_type": {
"properties": {
"number_one": {
"type": "byte"
},
"number_two": {
"type": "integer",
"ignore_malformed": false
}
}
}
}
}
The number_one field inherits the index-level setting. |
|
The number_two field overrides the index-level setting to turn off ignore_malformed. |
120.15. include_in_all
The include_in_all parameter provides per-field control over which fields
are included in the _all field. It defaults to true, unless index is set to no.
This example demonstrates how to exclude the date field from the _all field:
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"title": {
"type": "string"
}
"content": {
"type": "string"
},
"date": {
"type": "date",
"include_in_all": false
}
}
}
}
}
The title and content fields with be included in the _all field. |
|
The date field will not be included in the _all field. |
|
|
The include_in_all setting is allowed to have different settings for
fields of the same name in the same index. Its value can be updated on
existing fields using the PUT mapping API.
|
The include_in_all parameter can also be set at the type level and on
object or nested fields, in which case all sub-
fields inherit that setting. For instance:
PUT my_index
{
"mappings": {
"my_type": {
"include_in_all": false,
"properties": {
"title": { "type": "string" },
"author": {
"include_in_all": true,
"properties": {
"first_name": { "type": "string" },
"last_name": { "type": "string" }
}
},
"editor": {
"properties": {
"first_name": { "type": "string" },
"last_name": { "type": "string", "include_in_all": true }
}
}
}
}
}
}
All fields in my_type are excluded from _all. |
|
The author.first_name and author.last_name fields are included in _all. |
|
Only the editor.last_name field is included in _all.
The editor.first_name inherits the type-level setting and is excluded. |
|
|
Multi-fields and
include_in_allThe original field value is added to the |
120.16. index
The index option controls how field values are indexed and, thus, how they
are searchable. It accepts three values:
no
|
Do not add this field value to the index. With this setting, the field will not be queryable. |
not_analyzed
|
Add the field value to the index unchanged, as a single term. This is the
default for all fields that support this option except for
|
analyzed
|
This option applies only to |
For example, you can create a not_analyzed string field with the following:
PUT /my_index
{
"mappings": {
"my_type": {
"properties": {
"status_code": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
120.17. index_options
The index_options parameter controls what information is added to the
inverted index, for search and highlighting purposes. It accepts the
following settings:
docs
|
Only the doc number is indexed. Can answer the question Does this term exist in this field? |
freqs
|
Doc number and term frequencies are indexed. Term frequencies are used to score repeated terms higher than single terms. |
positions
|
Doc number, term frequencies, and term positions (or order) are indexed. Positions can be used for proximity or phrase queries. |
offsets
|
Doc number, term frequencies, positions, and start and end character offsets (which map the term back to the original string) are indexed. Offsets are used by the postings highlighter. |
Analyzed string fields use positions as the default, and
all other fields use docs as the default.
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"text": {
"type": "string",
"index_options": "offsets"
}
}
}
}
}
PUT my_index/my_type/1
{
"text": "Quick brown fox"
}
GET my_index/_search
{
"query": {
"match": {
"text": "brown fox"
}
},
"highlight": {
"fields": {
"text": {}
}
}
}
The text field will use the postings highlighter by default because offsets are indexed. |
120.18. lat_lon
Geo-queries are usually performed by plugging the value of
each geo_point field into a formula to determine whether it
falls into the required area or not. Unlike most queries, the inverted index
is not involved.
Setting lat_lon to true causes the latitude and longitude values to be
indexed as numeric fields (called .lat and .lon). These fields can be used
by the geo_bounding_box and
geo_distance queries instead of
performing in-memory calculations.
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"location": {
"type": "geo_point",
"lat_lon": true
}
}
}
}
}
PUT my_index/my_type/1
{
"location": {
"lat": 41.12,
"lon": -71.34
}
}
GET my_index/_search
{
"query": {
"geo_distance": {
"location": {
"lat": 41,
"lon": -71
},
"distance": "50km",
"optimize_bbox": "indexed"
}
}
}
Setting lat_lon to true indexes the geo-point in the location.lat and location.lon fields. |
|
The indexed option tells the geo-distance query to use the inverted index instead of the in-memory calculation. |
Whether the in-memory or indexed operation performs better depends both on your dataset and on the types of queries that you are running.
|
|
The lat_lon option only makes sense for single-value geo_point
fields. It will not work with arrays of geo-points.
|
120.19. fields
It is often useful to index the same field in different ways for different
purposes. This is the purpose of multi-fields. For instance, a string
field could be indexed as an analyzed field for full-text
search, and as a not_analyzed field for sorting or aggregations:
PUT /my_index
{
"mappings": {
"my_type": {
"properties": {
"city": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}
PUT /my_index/my_type/1
{
"city": "New York"
}
PUT /my_index/my_type/2
{
"city": "York"
}
GET /my_index/_search
{
"query": {
"match": {
"city": "york"
}
},
"sort": {
"city.raw": "asc"
},
"aggs": {
"Cities": {
"terms": {
"field": "city.raw"
}
}
}
}
The city.raw field is a not_analyzed version of the city field. |
|
The analyzed city field can be used for full text search. |
|
The city.raw field can be used for sorting and aggregations |
|
|
Multi-fields do not change the original _source field.
|
|
|
The fields setting is allowed to have different settings for fields of
the same name in the same index. New multi-fields can be added to existing
fields using the PUT mapping API.
|
120.19.1. Multi-fields with multiple analyzers
Another use case of multi-fields is to analyze the same field in different
ways for better relevance. For instance we could index a field with the
standard analyzer which breaks text up into
words, and again with the english analyzer
which stems words into their root form:
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"text": {
"type": "string",
"fields": {
"english": {
"type": "string",
"analyzer": "english"
}
}
}
}
}
}
}
PUT my_index/my_type/1
{ "text": "quick brown fox" }
PUT my_index/my_type/2
{ "text": "quick brown foxes" }
GET my_index/_search
{
"query": {
"multi_match": {
"query": "quick brown foxes",
"fields": [
"text",
"text.english"
],
"type": "most_fields"
}
}
}
The text field uses the standard analyzer. |
|
The text.english field uses the english analyzer. |
|
Index two documents, one with fox and the other with foxes. |
|
Query both the text and text.english fields and combine the scores. |
The text field contains the term fox in the first document and foxes in
the second document. The text.english field contains fox for both
documents, because foxes is stemmed to fox.
The query string is also analyzed by the standard analyzer for the text
field, and by the english analyzer` for the text.english field. The
stemmed field allows a query for foxes to also match the document containing
just fox. This allows us to match as many documents as possible. By also
querying the unstemmed text field, we improve the relevance score of the
document which matches foxes exactly.
120.20. norms
Norms store various normalization factors — a number to represent the
relative field length and the index time boost setting — that are later used at query time in order to compute the score of a document
relatively to a query.
Although useful for scoring, norms also require quite a lot of memory (typically in the order of one byte per document per field in your index, even for documents that don’t have this specific field). As a consequence, if you don’t need scoring on a specific field, you should disable norms on that field. In particular, this is the case for fields that are used solely for filtering or aggregations.
|
|
The norms.enabled setting must have the same setting for fields of the
same name in the same index. Norms can be disabled on existing fields using
the PUT mapping API.
|
Norms can be disabled (but not reenabled) after the fact, using the PUT mapping API like so:
PUT my_index/_mapping/my_type
{
"properties": {
"title": {
"type": "string",
"norms": {
"enabled": false
}
}
}
}
|
|
Norms will not be removed instantly, but will be removed as old segments are merged into new segments as you continue indexing new documents. Any score computation on a field that has had norms removed might return inconsistent results since some documents won’t have norms anymore while other documents might still have norms. |
120.20.1. Lazy loading of norms
Norms can be loaded into memory eagerly (eager), whenever a new segment
comes online, or they can loaded lazily (lazy, default), only when the field
is queried.
Eager loading can be configured as follows:
PUT my_index/_mapping/my_type
{
"properties": {
"title": {
"type": "string",
"norms": {
"loading": "eager"
}
}
}
}
|
|
The norms.loading setting must have the same setting for fields of the
same name in the same index. Its value can be updated on existing fields
using the PUT mapping API.
|
120.21. null_value
A null value cannot be indexed or searched. When a field is set to null,
(or an empty array or an array of null values) it is treated as though that
field has no values.
The null_value parameter allows you to replace explicit null values with
the specified value so that it can be indexed and searched. For instance:
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"status_code": {
"type": "string",
"index": "not_analyzed",
"null_value": "NULL"
}
}
}
}
}
PUT my_index/my_type/1
{
"status_code": null
}
PUT my_index/my_type/2
{
"status_code": []
}
GET my_index/_search
{
"query": {
"term": {
"status_code": "NULL"
}
}
}
Replace explicit null values with the term NULL. |
|
An empty array does not contain an explicit null, and so won’t be replaced with the null_value. |
|
A query for NULL returns document 1, but not document 2. |
|
|
The null_value needs to be the same datatype as the field. For
instance, a long field cannot have a string null_value. String fields
which are analyzed will also pass the null_value through the configured
analyzer.
|
Also see the missing query for its null_value support.
120.22. position_increment_gap
Analyzed string fields take term positions
into account, in order to be able to support
proximity or phrase queries.
When indexing string fields with multiple values a "fake" gap is added between
the values to prevent most phrase queries from matching across the values. The
size of this gap is configured using position_increment_gap and defaults to
100.
For example:
PUT /my_index/groups/1
{
"names": [ "John Abraham", "Lincoln Smith"]
}
GET /my_index/groups/_search
{
"query": {
"match_phrase": {
"names": "Abraham Lincoln"
}
}
}
GET /my_index/groups/_search
{
"query": {
"match_phrase": {
"names": "Abraham Lincoln",
"slop": 101
}
}
}
| This phrase query doesn’t match our document which is totally expected. | |
This phrase query matches our document, even though Abraham and Lincoln
are in separate strings, because slop > position_increment_gap. |
The position_increment_gap can be specified in the mapping. For instance:
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"names": {
"type": "string",
"position_increment_gap": 0
}
}
}
}
}
PUT /my_index/groups/1
{
"names": [ "John Abraham", "Lincoln Smith"]
}
GET /my_index/groups/_search
{
"query": {
"match_phrase": {
"names": "Abraham Lincoln"
}
}
}
| The first term in the next array element will be 0 terms apart from the last term in the previous array element. | |
| The phrase query matches our document which is weird, but its what we asked for in the mapping. |
|
|
The position_increment_gap setting is allowed to have different settings
for fields of the same name in the same index. Its value can be updated on
existing fields using the PUT mapping API.
|
120.23. precision_step
Most numeric datatypes index extra terms representing numeric
ranges for each number to make range queries
faster. For instance, this range query:
"range": {
"number": {
"gte": 0
"lte": 321
}
}
might be executed internally as a terms query that
looks something like this:
"terms": {
"number": [
"0-255",
"256-319"
"320",
"321"
]
}
These extra terms greatly reduce the number of terms that have to be examined, at the cost of increased disk space.
The default value for precision_step depends on the type of the numeric field:
long, double, date, ip
|
|
integer, float, short
|
|
byte
|
|
token_count
|
|
The value of the precision_step setting indicates the number of bits that
should be compressed into an extra term. A long value consists of 64 bits,
so a precision_step of 16 results in the following terms:
| Bits 0-15 |
|
| Bits 0-31 |
|
| Bits 0-47 |
|
| Bits 0-63 |
|
120.24. properties
Type mappings, object fields and nested fields
contain sub-fields, called properties. These properties may be of any
datatype, including object and nested. Properties can
be added:
-
explicitly by defining them when creating an index.
-
explicitly by defining them when adding or updating a mapping type with the PUT mapping API.
-
dynamically just by indexing documents containing new fields.
Below is an example of adding properties to a mapping type, an object
field, and a nested field:
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"manager": {
"properties": {
"age": { "type": "integer" },
"name": { "type": "string" }
}
},
"employees": {
"type": "nested",
"properties": {
"age": { "type": "integer" },
"name": { "type": "string" }
}
}
}
}
}
}
PUT my_index/my_type/1
{
"region": "US",
"manager": {
"name": "Alice White",
"age": 30
},
"employees": [
{
"name": "John Smith",
"age": 34
},
{
"name": "Peter Brown",
"age": 26
}
]
}
Properties under the my_type mapping type. |
|
Properties under the manager object field. |
|
Properties under the employees nested field. |
|
| An example document which corresponds to the above mapping. |
|
|
The properties setting is allowed to have different settings for fields
of the same name in the same index. New properties can be added to existing
fields using the PUT mapping API.
|
120.24.1. Dot notation
Inner fields can be referred to in queries, aggregations, etc., using dot notation:
GET my_index/_search
{
"query": {
"match": {
"manager.name": "Alice White"
}
},
"aggs": {
"Employees": {
"nested": {
"path": "employees"
},
"aggs": {
"Employee Ages": {
"histogram": {
"field": "employees.age",
"interval": 5
}
}
}
}
}
}
|
|
The full path to the inner field must be specified. |
120.25. search_analyzer
Usually, the same analyzer should be applied at index time and at search time, to ensure that the terms in the query are in the same format as the terms in the inverted index.
Sometimes, though, it can make sense to use a different analyzer at search
time, such as when using the edge_ngram
tokenizer for autocomplete.
By default, queries will use the analyzer defined in the field mapping, but
this can be overridden with the search_analyzer setting:
PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"text": {
"type": "string",
"analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
}
}
PUT my_index/my_type/1
{
"text": "Quick Brown Fox"
}
GET my_index/_search
{
"query": {
"match": {
"text": {
"query": "Quick Br",
"operator": "and"
}
}
}
}
Analysis settings to define the custom autocomplete analyzer. |
|
The text field uses the autocomplete analyzer at index time, but the standard analyzer at search time. |
|
This field is indexed as the terms: [ q, qu, qui, quic, quick, b, br, bro, brow, brown, f, fo, fox ] |
|
The query searches for both of these terms: [ quick, br ] |
See Index time search-as-you- type for a full explanation of this example.
|
|
The search_analyzer setting must have the same setting for fields of
the same name in the same index. Its value can be updated on existing fields
using the PUT mapping API.
|
120.26. similarity
Elasticsearch allows you to configure a scoring algorithm or similarity per
field. The similarity setting provides a simple way of choosing a similarity
algorithm other than the default TF/IDF, such as BM25.
Similarities are mostly useful for string fields, especially
analyzed string fields, but can also apply to other field types.
Custom similarities can be configured by tuning the parameters of the built-in similarities. For more details about this expert options, see the similarity module.
The only similarities which can be used out of the box, without any further configuration are:
default-
The Default TF/IDF algorithm used by Elasticsearch and Lucene. See Lucene’s Practical Scoring Function for more information.
BM25-
The Okapi BM25 algorithm. See Pluggable Similarity Algorithms for more information.
The similarity can be set on the field level when a field is first created,
as follows:
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"default_field": {
"type": "string"
},
"bm25_field": {
"type": "string",
"similarity": "BM25"
}
}
}
}
}
The default_field uses the default similarity (ie TF/IDF). |
|
The bm25_field uses the BM25 similarity. |
120.27. store
By default, field values are indexed to make them searchable, but they are not stored. This means that the field can be queried, but the original field value cannot be retrieved.
Usually this doesn’t matter. The field value is already part of the
_source field, which is stored by default. If you
only want to retrieve the value of a single field or of a few fields, instead
of the whole _source, then this can be achieved with
source filtering.
In certain situations it can make sense to store a field. For instance, if
you have a document with a title, a date, and a very large content
field, you may want to retrieve just the title and the date without having
to extract those fields from a large _source field:
PUT /my_index
{
"mappings": {
"my_type": {
"properties": {
"title": {
"type": "string",
"store": true
},
"date": {
"type": "date",
"store": true
},
"content": {
"type": "string"
}
}
}
}
}
PUT /my_index/my_type/1
{
"title": "Some short title",
"date": "2015-01-01",
"content": "A very long content field..."
}
GET my_index/_search
{
"fields": [ "title", "date" ]
}
The title and date fields are stored. |
|
This request will retrieve the values of the title and date fields. |
|
|
Stored fields returned as arrays
For consistency, stored fields are always returned as an array because there is no way of knowing if the original field value was a single value, multiple values, or an empty array. If you need the original value, you should retrieve it from the |
Another situation where it can make sense to make a field stored is for those
that don’t appear in the _source field (such as copy_to fields).
120.28. term_vector
Term vectors contain information about the terms produced by the analysis process, including:
-
a list of terms.
-
the position (or order) of each term.
-
the start and end character offsets mapping the term to its origin in the original string.
These term vectors can be stored so that they can be retrieved for a particular document.
The term_vector setting accepts:
no
|
No term vectors are stored. (default) |
yes
|
Just the terms in the field are stored. |
with_positions
|
Terms and positions are stored. |
with_offsets
|
Terms and character offsets are stored. |
with_positions_offsets
|
Terms, positions, and character offsets are stored. |
The fast vector highlighter requires with_positions_offsets. The term
vectors API can retrieve whatever is stored.
|
|
Setting with_positions_offsets will double the size of a field’s
index.
|
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"text": {
"type": "string",
"term_vector": "with_positions_offsets"
}
}
}
}
}
PUT my_index/my_type/1
{
"text": "Quick brown fox"
}
GET my_index/_search
{
"query": {
"match": {
"text": "brown fox"
}
},
"highlight": {
"fields": {
"text": {}
}
}
}
The fast vector highlighter will be used by default for the text field
because term vectors are enabled. |
121. Dynamic Mapping
One of the most important features of Elasticsearch is that it tries to get out of your way and let you start exploring your data as quickly as possible. To index a document, you don’t have to first create an index, define a mapping type, and define your fields — you can just index a document and the index, type, and fields will spring to life automatically:
PUT data/counters/1
{ "count": 5 }
Creates the data index, the counters mapping type, and a field
called count with datatype long. |
The automatic detection and addition of new types and fields is called dynamic mapping. The dynamic mapping rules can be customised to suit your purposes with:
_default_mapping-
Configure the base mapping to be used for new mapping types.
- Dynamic field mappings
-
The rules governing dynamic field detection.
- Dynamic templates
-
Custom rules to configure the mapping for dynamically added fields.
|
|
Index templates allow you to configure the default mappings, settings, aliases, and warmers for new indices, whether created automatically or explicitly. |
Disabling automatic type creation
Automatic type creation can be disabled by setting the index.mapper.dynamic
setting to false, either by setting the default value in the
config/elasticsearch.yml file, or per-index as an index setting:
PUT /_settings
{
"index.mapper.dynamic":false
}
| Disable automatic type creation for all indices. |
Regardless of the value of this setting, types can still be added explicitly when creating an index or with the PUT mapping API.
121.1. _default_ mapping
The default mapping, which will be used as the base mapping for any new
mapping types, can be customised by adding a mapping type with the name
_default_ to an index, either when
creating the index or later on with the
PUT mapping API.
PUT my_index
{
"mappings": {
"_default_": {
"_all": {
"enabled": false
}
},
"user": {},
"blogpost": {
"_all": {
"enabled": true
}
}
}
}
The _default_ mapping defaults the _all field to disabled. |
|
The user type inherits the settings from _default_. |
|
The blogpost type overrides the defaults and enables the _all field. |
While the _default_ mapping can be updated after an index has been created,
the new defaults will only affect mapping types that are created afterwards.
The _default_ mapping can be used in conjunction with
Index templates to control dynamically created types
within automatically created indices:
PUT _template/logging
{
"template": "logs-*",
"settings": { "number_of_shards": 1 },
"mappings": {
"_default_": {
"_all": {
"enabled": false
},
"dynamic_templates": [
{
"strings": {
"match_mapping_type": "string",
"mapping": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed",
"ignore_above": 256
}
}
}
}
}
]
}
}
}
PUT logs-2015.10.01/event/1
{ "message": "error:16" }
The logging template will match any indices beginning with logs-. |
|
| Matching indices will be created with a single primary shard. | |
The _all field will be disabled by default for new type mappings. |
|
String fields will be created with an analyzed main field, and a not_analyzed .raw field. |
121.2. Dynamic field mapping
By default, when a previously unseen field is found in a document,
Elasticsearch will add the new field to the type mapping. This behaviour can
be disabled, both at the document and at the object level, by
setting the dynamic parameter to false or to strict.
Assuming dynamic field mapping is enabled, some simple rules are used to
determine which datatype the field should have:
| JSON datatype |
Elasticsearch datatype |
null
|
No field is added. |
true or false
|
|
| floating point number |
|
| integer |
|
| object |
|
| array |
Depends on the first non- |
| string |
Either a |
These are the only field datatypes that are dynamically detected. All other datatypes must be mapped explicitly.
Besides the options listed below, dynamic field mapping rules can be further
customised with dynamic_templates.
121.2.1. Date detection
If date_detection is enabled (default), then new string fields are checked
to see whether their contents match any of the date patterns specified in
dynamic_date_formats. If a match is found, a new date field is
added with the corresponding format.
The default value for dynamic_date_formats is:
[ "strict_date_optional_time","yyyy/MM/dd HH:mm:ss Z||yyyy/MM/dd Z"]
For example:
PUT my_index/my_type/1
{
"create_date": "2015/09/02"
}
GET my_index/_mapping 
The create_date field has been added as a date
field with the format:"yyyy/MM/dd HH:mm:ss Z||yyyy/MM/dd Z". |
Disabling date detection
Dynamic date dection can be disabled by setting date_detection to false:
PUT my_index
{
"mappings": {
"my_type": {
"date_detection": false
}
}
}
PUT my_index/my_type/1
{
"create": "2015/09/02"
}
The create_date field has been added as a string field. |
Customising detected date formats
Alternatively, the dynamic_date_formats can be customised to support your
own date formats:
PUT my_index
{
"mappings": {
"my_type": {
"dynamic_date_formats": ["MM/dd/yyyy"]
}
}
}
PUT my_index/my_type/1
{
"create_date": "09/25/2015"
}
121.2.2. Numeric detection
While JSON has support for native floating point and integer datatypes, some applications or languages may sometimes render numbers as strings. Usually the correct solution is to map these fields explicitly, but numeric detection (which is disabled by default) can be enabled to do this automatically:
PUT my_index
{
"mappings": {
"my_type": {
"numeric_detection": true
}
}
}
PUT my_index/my_type/1
{
"my_float": "1.0",
"my_integer": "1"
}
121.3. Dynamic templates
Dynamic templates allow you to define custom mappings that can be applied to dynamically added fields based on:
-
the datatype detected by Elasticsearch, with
match_mapping_type. -
the name of the field, with
matchandunmatchormatch_pattern. -
the full dotted path to the field, with
path_matchandpath_unmatch.
The original field name {name} and the detected datatype
{dynamic_type} template variables can be used in
the mapping specification as placeholders.
|
|
Dynamic field mappings are only added when a field contains a
concrete value — not null or an empty array. This means that if the
null_value option is used in a dynamic_template, it will only be applied
after the first document with a concrete value for the field has been
indexed.
|
Dynamic templates are specified as an array of named objects:
"dynamic_templates": [
{
"my_template_name": {
... match conditions ...
"mapping": { ... }
}
},
...
]
| The template name can be any string value. | |
The match conditions can include any of : match_mapping_type, match, match_pattern, unmatch, match_path, unmatch_path. |
|
| The mapping that the matched field should use. |
Templates are processed in order — the first matching template wins. New templates can be appended to the end of the list with the PUT mapping API. If a new template has the same name as an existing template, it will replace the old version.
121.3.1. match_mapping_type
The match_mapping_type matches on the datatype detected by
dynamic field mapping, in other words, the datatype
that Elasticsearch thinks the field should have. Only the following datatypes
can be automatically detected: boolean, date, double, long, object,
string. It also accepts * to match all datatypes.
For example, if we wanted to map all integer fields as integer instead of
long, and all string fields as both analyzed and not_analyzed, we
could use the following template:
PUT my_index
{
"mappings": {
"my_type": {
"dynamic_templates": [
{
"integers": {
"match_mapping_type": "long",
"mapping": {
"type": "integer"
}
}
},
{
"strings": {
"match_mapping_type": "string",
"mapping": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed",
"ignore_above": 256
}
}
}
}
}
]
}
}
}
PUT my_index/my_type/1
{
"my_integer": 5,
"my_string": "Some string"
}
The my_integer field is mapped as an integer. |
|
The my_string field is mapped as an analyzed string, with a not_analyzed multi field. |
121.3.2. match and unmatch
The match parameter uses a pattern to match on the fieldname, while
unmatch uses a pattern to exclude fields matched by match.
The following example matches all string fields whose name starts with
long_ (except for those which end with _text) and maps them as long
fields:
PUT my_index
{
"mappings": {
"my_type": {
"dynamic_templates": [
{
"longs_as_strings": {
"match_mapping_type": "string",
"match": "long_*",
"unmatch": "*_text",
"mapping": {
"type": "long"
}
}
}
]
}
}
}
PUT my_index/my_type/1
{
"long_num": "5",
"long_text": "foo"
}
The long_num field is mapped as a long. |
|
The long_text field uses the default string mapping. |
121.3.3. match_pattern
The match_pattern parameter adjusts the behavior of the match parameter
such that it supports full Java regular expression matching on the field name
instead of simple wildcards, for instance:
"match_pattern": "regex",
"match": "^profit_\d+$"
121.3.4. path_match and path_unmatch
The path_match and path_unmatch parameters work in the same way as match
and unmatch, but operate on the full dotted path to the field, not just the
final name, e.g. some_object.*.some_field.
This example copies the values of any fields in the name object to the
top-level full_name field, except for the middle field:
PUT my_index
{
"mappings": {
"my_type": {
"dynamic_templates": [
{
"full_name": {
"path_match": "name.*",
"path_unmatch": "*.middle",
"mapping": {
"type": "string",
"copy_to": "full_name"
}
}
}
]
}
}
}
PUT my_index/my_type/1
{
"name": {
"first": "Alice",
"middle": "Mary",
"last": "White"
}
}
121.3.5. {name} and {dynamic_type}
The {name} and {dynamic_type} placeholders are replaced in the mapping
with the field name and detected dynamic type. The following example sets all
string fields to use an analyzer with the same name as the
field, and disables doc_values for all non-string fields:
PUT my_index
{
"mappings": {
"my_type": {
"dynamic_templates": [
{
"named_analyzers": {
"match_mapping_type": "string",
"match": "*",
"mapping": {
"type": "string",
"analyzer": "{name}"
}
}
},
{
"no_doc_values": {
"match_mapping_type":"*",
"mapping": {
"type": "{dynamic_type}",
"doc_values": false
}
}
}
]
}
}
}
PUT my_index/my_type/1
{
"english": "Some English text",
"count": 5
}
The english field is mapped as a string field with the english analyzer. |
|
The count field is mapped as a long field with doc_values disabled |
121.4. Override default template
You can override the default mappings for all indices and all types
by specifying a _default_ type mapping in an index template
which matches all indices.
For example, to disable the _all field by default for all types in all
new indices, you could create the following index template:
PUT _template/disable_all_field
{
"disable_all_field": {
"order": 0,
"template": "*",
"mappings": {
"_default_": {
"_all": {
"enabled": false
}
}
}
}
}
Applies the mappings to an index which matches the pattern *, in other
words, all new indices. |
|
Defines the _default_ type mapping types within the index. |
|
Disables the _all field by default. |
122. Transform
deprecated[2.0.0]
The document can be transformed before it is indexed by registering a script in
the transform element of the mapping. The result of the transform is indexed
but the original source is stored in the _source field.
This was deprecated in 2.0.0 because it made debugging very difficult. As of now there really isn’t a feature to use in its place other than transforming the document in the client application.
Deprecated or no, here is an example:
{
"example" : {
"transform" : {
"script" : {
"inline": "if (ctx._source['title']?.startsWith('t')) ctx._source['suggest'] = ctx._source['content']",
"params" : {
"variable" : "not used but an example anyway"
},
"lang": "groovy"
}
},
"properties": {
"title": { "type": "string" },
"content": { "type": "string" },
"suggest": { "type": "string" }
}
}
}
Its also possible to specify multiple transforms:
{
"example" : {
"transform" : [
{"script": "ctx._source['suggest'] = ctx._source['content']"},
{"script": "ctx._source['foo'] = ctx._source['bar'];"}
]
}
}
Because the result isn’t stored in the source it can’t normally be fetched by source filtering. It can be highlighted if it is marked as stored.
122.1. Get Transformed
The get endpoint will retransform the source if the _source_transform
parameter is set. Example:
curl -XGET "http://localhost:9200/test/example/3?pretty&_source_transform"
The transform is performed before any source filtering but it is mostly designed to make it easy to see what was passed to the index for debugging.
Analysis
The index analysis module acts as a configurable registry of Analyzers
that can be used in order to both break indexed (analyzed) fields when a
document is indexed and process query strings. It maps to the Lucene
Analyzer.
Analyzers are composed of a single Tokenizer
and zero or more TokenFilters. The tokenizer may
be preceded by one or more CharFilters. The
analysis module allows one to register TokenFilters, Tokenizers and
Analyzers under logical names that can then be referenced either in
mapping definitions or in certain APIs. The Analysis module
automatically registers (if not explicitly defined) built in
analyzers, token filters, and tokenizers.
Here is a sample configuration:
index :
analysis :
analyzer :
standard :
type : standard
stopwords : [stop1, stop2]
myAnalyzer1 :
type : standard
stopwords : [stop1, stop2, stop3]
max_token_length : 500
# configure a custom analyzer which is
# exactly like the default standard analyzer
myAnalyzer2 :
tokenizer : standard
filter : [standard, lowercase, stop]
tokenizer :
myTokenizer1 :
type : standard
max_token_length : 900
myTokenizer2 :
type : keyword
buffer_size : 512
filter :
myTokenFilter1 :
type : stop
stopwords : [stop1, stop2, stop3, stop4]
myTokenFilter2 :
type : length
min : 0
max : 2000
Backwards compatibility
All analyzers, tokenizers, and token filters can be configured with a
version parameter to control which Lucene version behavior they should
use. Possible values are: 3.0 - 3.6, 4.0 - 4.3 (the highest
version number is the default option).
123. Analyzers
Analyzers are composed of a single Tokenizer
and zero or more TokenFilters. The tokenizer may
be preceded by one or more CharFilters.
The analysis module allows you to register Analyzers under logical
names which can then be referenced either in mapping definitions or in
certain APIs.
Elasticsearch comes with a number of prebuilt analyzers which are ready to use. Alternatively, you can combine the built in character filters, tokenizers and token filters to create custom analyzers.
Default Analyzers
An analyzer is registered under a logical name. It can then be referenced from mapping definitions or certain APIs. When none are defined, defaults are used. There is an option to define which analyzers will be used by default when none can be derived.
The default logical name allows one to configure an analyzer that will
be used both for indexing and for searching APIs. The default_search
logical name can be used to configure a default analyzer that will be
used just when searching (the default analyzer would still be used for
indexing).
For instance, the following settings could be used to perform exact matching only by default:
index :
analysis :
analyzer :
default :
tokenizer : keyword
Aliasing Analyzers
Analyzers can be aliased to have several registered lookup names
associated with them. For example, the following will allow
the standard analyzer to also be referenced with alias1
and alias2 values.
index :
analysis :
analyzer :
standard :
alias: [alias1, alias2]
type : standard
stopwords : [test1, test2, test3]
Below is a list of the built in analyzers.
123.1. Standard Analyzer
An analyzer of type standard is built using the
Standard
Tokenizer with the
Standard
Token Filter,
Lower
Case Token Filter, and
Stop
Token Filter.
The following are settings that can be set for a standard analyzer
type:
| Setting | Description |
|---|---|
|
A list of stopwords to initialize the stop filter with. Defaults to an empty stopword list Check Stop Analyzer for more details. |
|
The maximum token length. If a token is seen that exceeds
this length then it is split at |
123.2. Simple Analyzer
An analyzer of type simple that is built using a
Lower
Case Tokenizer.
123.3. Whitespace Analyzer
An analyzer of type whitespace that is built using a
Whitespace
Tokenizer.
123.4. Stop Analyzer
An analyzer of type stop that is built using a
Lower
Case Tokenizer, with
Stop
Token Filter.
The following are settings that can be set for a stop analyzer type:
| Setting | Description |
|---|---|
|
A list of stopwords to initialize the stop filter with. Defaults to the english stop words. |
|
A path (either relative to |
Use stopwords: _none_ to explicitly specify an empty stopword list.
123.5. Keyword Analyzer
An analyzer of type keyword that "tokenizes" an entire stream as a
single token. This is useful for data like zip codes, ids and so on.
Note, when using mapping definitions, it might make more sense to simply
mark the field as not_analyzed.
123.6. Pattern Analyzer
An analyzer of type pattern that can flexibly separate text into terms
via a regular expression. Accepts the following settings:
The following are settings that can be set for a pattern analyzer
type:
lowercase
|
Should terms be lowercased or not. Defaults to |
pattern
|
The regular expression pattern, defaults to |
flags
|
The regular expression flags. |
stopwords
|
A list of stopwords to initialize the stop filter with. Defaults to an empty stopword list Check Stop Analyzer for more details. |
IMPORTANT: The regular expression should match the token separators, not the tokens themselves.
Flags should be pipe-separated, eg "CASE_INSENSITIVE|COMMENTS". Check
Java
Pattern API for more details about flags options.
Pattern Analyzer Examples
In order to try out these examples, you should delete the test index
before running each example.
Whitespace tokenizer
DELETE test
PUT /test
{
"settings": {
"analysis": {
"analyzer": {
"whitespace": {
"type": "pattern",
"pattern": "\\s+"
}
}
}
}
}
GET /test/_analyze?analyzer=whitespace&text=foo,bar baz
# "foo,bar", "baz"
Non-word character tokenizer
DELETE test
PUT /test
{
"settings": {
"analysis": {
"analyzer": {
"nonword": {
"type": "pattern",
"pattern": "[^\\w]+"
}
}
}
}
}
GET /test/_analyze?analyzer=nonword&text=foo,bar baz
# "foo,bar baz" becomes "foo", "bar", "baz"
GET /test/_analyze?analyzer=nonword&text=type_1-type_4
# "type_1","type_4"
CamelCase tokenizer
DELETE test
PUT /test?pretty=1
{
"settings": {
"analysis": {
"analyzer": {
"camel": {
"type": "pattern",
"pattern": "([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])"
}
}
}
}
}
GET /test/_analyze?analyzer=camel&text=MooseX::FTPClass2_beta
# "moose","x","ftp","class","2","beta"
The regex above is easier to understand as:
([^\p{L}\d]+) # swallow non letters and numbers,
| (?<=\D)(?=\d) # or non-number followed by number,
| (?<=\d)(?=\D) # or number followed by non-number,
| (?<=[ \p{L} && [^\p{Lu}]]) # or lower case
(?=\p{Lu}) # followed by upper case,
| (?<=\p{Lu}) # or upper case
(?=\p{Lu} # followed by upper case
[\p{L}&&[^\p{Lu}]] # then lower case
)
123.7. Language Analyzers
A set of analyzers aimed at analyzing specific language text. The
following types are supported:
arabic,
armenian,
basque,
brazilian,
bulgarian,
catalan,
cjk,
czech,
danish,
dutch,
english,
finnish,
french,
galician,
german,
greek,
hindi,
hungarian,
indonesian,
irish,
italian,
latvian,
lithuanian,
norwegian,
persian,
portuguese,
romanian,
russian,
sorani,
spanish,
swedish,
turkish,
thai.
123.7.1. Configuring language analyzers
Stopwords
All analyzers support setting custom stopwords either internally in
the config, or by using an external stopwords file by setting
stopwords_path. Check Stop Analyzer for
more details.
Excluding words from stemming
The stem_exclusion parameter allows you to specify an array
of lowercase words that should not be stemmed. Internally, this
functionality is implemented by adding the
keyword_marker token filter
with the keywords set to the value of the stem_exclusion parameter.
The following analyzers support setting custom stem_exclusion list:
arabic, armenian, basque, catalan, bulgarian, catalan,
czech, finnish, dutch, english, finnish, french, galician,
german, irish, hindi, hungarian, indonesian, italian, latvian,
lithuanian, norwegian, portuguese, romanian, russian, sorani,
spanish, swedish, turkish.
123.7.2. Reimplementing language analyzers
The built-in language analyzers can be reimplemented as custom analyzers
(as described below) in order to customize their behaviour.
|
|
If you do not intend to exclude words from being stemmed (the
equivalent of the stem_exclusion parameter above), then you should remove
the keyword_marker token filter from the custom analyzer configuration.
|
arabic analyzer
The arabic analyzer could be reimplemented as a custom analyzer as follows:
{
"settings": {
"analysis": {
"filter": {
"arabic_stop": {
"type": "stop",
"stopwords": "_arabic_"
},
"arabic_keywords": {
"type": "keyword_marker",
"keywords": []
},
"arabic_stemmer": {
"type": "stemmer",
"language": "arabic"
}
},
"analyzer": {
"arabic": {
"tokenizer": "standard",
"filter": [
"lowercase",
"arabic_stop",
"arabic_normalization",
"arabic_keywords",
"arabic_stemmer"
]
}
}
}
}
}
The default stopwords can be overridden with the stopwords
or stopwords_path parameters. |
|
| This filter should be removed unless there are words which should be excluded from stemming. |
armenian analyzer
The armenian analyzer could be reimplemented as a custom analyzer as follows:
{
"settings": {
"analysis": {
"filter": {
"armenian_stop": {
"type": "stop",
"stopwords": "_armenian_"
},
"armenian_keywords": {
"type": "keyword_marker",
"keywords": []
},
"armenian_stemmer": {
"type": "stemmer",
"language": "armenian"
}
},
"analyzer": {
"armenian": {
"tokenizer": "standard",
"filter": [
"lowercase",
"armenian_stop",
"armenian_keywords",
"armenian_stemmer"
]
}
}
}
}
}
The default stopwords can be overridden with the stopwords
or stopwords_path parameters. |
|
| This filter should be removed unless there are words which should be excluded from stemming. |
basque analyzer
The basque analyzer could be reimplemented as a custom analyzer as follows:
{
"settings": {
"analysis": {
"filter": {
"basque_stop": {
"type": "stop",
"stopwords": "_basque_"
},
"basque_keywords": {
"type": "keyword_marker",
"keywords": []
},
"basque_stemmer": {
"type": "stemmer",
"language": "basque"
}
},
"analyzer": {
"basque": {
"tokenizer": "standard",
"filter": [
"lowercase",
"basque_stop",
"basque_keywords",
"basque_stemmer"
]
}
}
}
}
}
The default stopwords can be overridden with the stopwords
or stopwords_path parameters. |
|
| This filter should be removed unless there are words which should be excluded from stemming. |
brazilian analyzer
The brazilian analyzer could be reimplemented as a custom analyzer as follows:
{
"settings": {
"analysis": {
"filter": {
"brazilian_stop": {
"type": "stop",
"stopwords": "_brazilian_"
},
"brazilian_keywords": {
"type": "keyword_marker",
"keywords": []
},
"brazilian_stemmer": {
"type": "stemmer",
"language": "brazilian"
}
},
"analyzer": {
"brazilian": {
"tokenizer": "standard",
"filter": [
"lowercase",
"brazilian_stop",
"brazilian_keywords",
"brazilian_stemmer"
]
}
}
}
}
}
The default stopwords can be overridden with the stopwords
or stopwords_path parameters. |
|
| This filter should be removed unless there are words which should be excluded from stemming. |
bulgarian analyzer
The bulgarian analyzer could be reimplemented as a custom analyzer as follows:
{
"settings": {
"analysis": {
"filter": {
"bulgarian_stop": {
"type": "stop",
"stopwords": "_bulgarian_"
},
"bulgarian_keywords": {
"type": "keyword_marker",
"keywords": []
},
"bulgarian_stemmer": {
"type": "stemmer",
"language": "bulgarian"
}
},
"analyzer": {
"bulgarian": {
"tokenizer": "standard",
"filter": [
"lowercase",
"bulgarian_stop",
"bulgarian_keywords",
"bulgarian_stemmer"
]
}
}
}
}
}
The default stopwords can be overridden with the stopwords
or stopwords_path parameters. |
|
| This filter should be removed unless there are words which should be excluded from stemming. |
catalan analyzer
The catalan analyzer could be reimplemented as a custom analyzer as follows:
{
"settings": {
"analysis": {
"filter": {
"catalan_elision": {
"type": "elision",
"articles": [ "d", "l", "m", "n", "s", "t"]
},
"catalan_stop": {
"type": "stop",
"stopwords": "_catalan_"
},
"catalan_keywords": {
"type": "keyword_marker",
"keywords": []
},
"catalan_stemmer": {
"type": "stemmer",
"language": "catalan"
}
},
"analyzer": {
"catalan": {
"tokenizer": "standard",
"filter": [
"catalan_elision",
"lowercase",
"catalan_stop",
"catalan_keywords",
"catalan_stemmer"
]
}
}
}
}
}
The default stopwords can be overridden with the stopwords
or stopwords_path parameters. |
|
| This filter should be removed unless there are words which should be excluded from stemming. |
cjk analyzer
The cjk analyzer could be reimplemented as a custom analyzer as follows:
{
"settings": {
"analysis": {
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_"
}
},
"analyzer": {
"cjk": {
"tokenizer": "standard",
"filter": [
"cjk_width",
"lowercase",
"cjk_bigram",
"english_stop"
]
}
}
}
}
}
The default stopwords can be overridden with the stopwords
or stopwords_path parameters. |
czech analyzer
The czech analyzer could be reimplemented as a custom analyzer as follows:
{
"settings": {
"analysis": {
"filter": {
"czech_stop": {
"type": "stop",
"stopwords": "_czech_"
},
"czech_keywords": {
"type": "keyword_marker",
"keywords": []
},
"czech_stemmer": {
"type": "stemmer",
"language": "czech"
}
},
"analyzer": {
"czech": {
"tokenizer": "standard",
"filter": [
"lowercase",
"czech_stop",
"czech_keywords",
"czech_stemmer"
]
}
}
}
}
}
The default stopwords can be overridden with the stopwords
or stopwords_path parameters. |
|
| This filter should be removed unless there are words which should be excluded from stemming. |
danish analyzer
The danish analyzer could be reimplemented as a custom analyzer as follows:
{
"settings": {
"analysis": {
"filter": {
"danish_stop": {
"type": "stop",
"stopwords": "_danish_"
},
"danish_keywords": {
"type": "keyword_marker",
"keywords": []
},
"danish_stemmer": {
"type": "stemmer",
"language": "danish"
}
},
"analyzer": {
"danish": {
"tokenizer": "standard",
"filter": [
"lowercase",
"danish_stop",
"danish_keywords",
"danish_stemmer"
]
}
}
}
}
}
The default stopwords can be overridden with the stopwords
or stopwords_path parameters. |
|
| This filter should be removed unless there are words which should be excluded from stemming. |
dutch analyzer
The dutch analyzer could be reimplemented as a custom analyzer as follows:
{
"settings": {
"analysis": {
"filter": {
"dutch_stop": {
"type": "stop",
"stopwords": "_dutch_"
},
"dutch_keywords": {
"type": "keyword_marker",
"keywords": []
},
"dutch_stemmer": {
"type": "stemmer",
"language": "dutch"
},
"dutch_override": {
"type": "stemmer_override",
"rules": [
"fiets=>fiets",
"bromfiets=>bromfiets",
"ei=>eier",
"kind=>kinder"
]
}
},
"analyzer": {
"dutch": {
"tokenizer": "standard",
"filter": [
"lowercase",
"dutch_stop",
"dutch_keywords",
"dutch_override",
"dutch_stemmer"
]
}
}
}
}
}
The default stopwords can be overridden with the stopwords
or stopwords_path parameters. |
|
| This filter should be removed unless there are words which should be excluded from stemming. |
english analyzer
The english analyzer could be reimplemented as a custom analyzer as follows:
{
"settings": {
"analysis": {
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"english_keywords": {
"type": "keyword_marker",
"keywords": []
},
"english_stemmer": {
"type": "stemmer",
"language": "english"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
}
},
"analyzer": {
"english": {
"tokenizer": "standard",
"filter": [
"english_possessive_stemmer",
"lowercase",
"english_stop",
"english_keywords",
"english_stemmer"
]
}
}
}
}
}
The default stopwords can be overridden with the stopwords
or stopwords_path parameters. |
|
| This filter should be removed unless there are words which should be excluded from stemming. |
finnish analyzer
The finnish analyzer could be reimplemented as a custom analyzer as follows:
{
"settings": {
"analysis": {
"filter": {
"finnish_stop": {
"type": "stop",
"stopwords": "_finnish_"
},
"finnish_keywords": {
"type": "keyword_marker",
"keywords": []
},
"finnish_stemmer": {
"type": "stemmer",
"language": "finnish"
}
},
"analyzer": {
"finnish": {
"tokenizer": "standard",
"filter": [
"lowercase",
"finnish_stop",
"finnish_keywords",
"finnish_stemmer"
]
}
}
}
}
}
The default stopwords can be overridden with the stopwords
or stopwords_path parameters. |
|
| This filter should be removed unless there are words which should be excluded from stemming. |
french analyzer
The french analyzer could be reimplemented as a custom analyzer as follows:
{
"settings": {
"analysis": {
"filter": {
"french_elision": {
"type": "elision",
"articles_case": true,
"articles": [
"l", "m", "t", "qu", "n", "s",
"j", "d", "c", "jusqu", "quoiqu",
"lorsqu", "puisqu"
]
},
"french_stop": {
"type": "stop",
"stopwords": "_french_"
},
"french_keywords": {
"type": "keyword_marker",
"keywords": []
},
"french_stemmer": {
"type": "stemmer",
"language": "light_french"
}
},
"analyzer": {
"french": {
"tokenizer": "standard",
"filter": [
"french_elision",
"lowercase",
"french_stop",
"french_keywords",
"french_stemmer"
]
}
}
}
}
}
The default stopwords can be overridden with the stopwords
or stopwords_path parameters. |
|
| This filter should be removed unless there are words which should be excluded from stemming. |
galician analyzer
The galician analyzer could be reimplemented as a custom analyzer as follows:
{
"settings": {
"analysis": {
"filter": {
"galician_stop": {
"type": "stop",
"stopwords": "_galician_"
},
"galician_keywords": {
"type": "keyword_marker",
"keywords": []
},
"galician_stemmer": {
"type": "stemmer",
"language": "galician"
}
},
"analyzer": {
"galician": {
"tokenizer": "standard",
"filter": [
"lowercase",
"galician_stop",
"galician_keywords",
"galician_stemmer"
]
}
}
}
}
}
The default stopwords can be overridden with the stopwords
or stopwords_path parameters. |
|
| This filter should be removed unless there are words which should be excluded from stemming. |
german analyzer
The german analyzer could be reimplemented as a custom analyzer as follows:
{
"settings": {
"analysis": {
"filter": {
"german_stop": {
"type": "stop",
"stopwords": "_german_"
},
"german_keywords": {
"type": "keyword_marker",
"keywords": []
},
"german_stemmer": {
"type": "stemmer",
"language": "light_german"
}
},
"analyzer": {
"german": {
"tokenizer": "standard",
"filter": [
"lowercase",
"german_stop",
"german_keywords",
"german_normalization",
"german_stemmer"
]
}
}
}
}
}
The default stopwords can be overridden with the stopwords
or stopwords_path parameters. |
|
| This filter should be removed unless there are words which should be excluded from stemming. |
greek analyzer
The greek analyzer could be reimplemented as a custom analyzer as follows:
{
"settings": {
"analysis": {
"filter": {
"greek_stop": {
"type": "stop",
"stopwords": "_greek_"
},
"greek_lowercase": {
"type": "lowercase",
"language": "greek"
},
"greek_keywords": {
"type": "keyword_marker",
"keywords": []
},
"greek_stemmer": {
"type": "stemmer",
"language": "greek"
}
},
"analyzer": {
"greek": {
"tokenizer": "standard",
"filter": [
"greek_lowercase",
"greek_stop",
"greek_keywords",
"greek_stemmer"
]
}
}
}
}
}
The default stopwords can be overridden with the stopwords
or stopwords_path parameters. |
|
| This filter should be removed unless there are words which should be excluded from stemming. |
hindi analyzer
The hindi analyzer could be reimplemented as a custom analyzer as follows:
{
"settings": {
"analysis": {
"filter": {
"hindi_stop": {
"type": "stop",
"stopwords": "_hindi_"
},
"hindi_keywords": {
"type": "keyword_marker",
"keywords": []
},
"hindi_stemmer": {
"type": "stemmer",
"language": "hindi"
}
},
"analyzer": {
"hindi": {
"tokenizer": "standard",
"filter": [
"lowercase",
"indic_normalization",
"hindi_normalization",
"hindi_stop",
"hindi_keywords",
"hindi_stemmer"
]
}
}
}
}
}
The default stopwords can be overridden with the stopwords
or stopwords_path parameters. |
|
| This filter should be removed unless there are words which should be excluded from stemming. |
hungarian analyzer
The hungarian analyzer could be reimplemented as a custom analyzer as follows:
{
"settings": {
"analysis": {
"filter": {
"hungarian_stop": {
"type": "stop",
"stopwords": "_hungarian_"
},
"hungarian_keywords": {
"type": "keyword_marker",
"keywords": []
},
"hungarian_stemmer": {
"type": "stemmer",
"language": "hungarian"
}
},
"analyzer": {
"hungarian": {
"tokenizer": "standard",
"filter": [
"lowercase",
"hungarian_stop",
"hungarian_keywords",
"hungarian_stemmer"
]
}
}
}
}
}
The default stopwords can be overridden with the stopwords
or stopwords_path parameters. |
|
| This filter should be removed unless there are words which should be excluded from stemming. |
indonesian analyzer
The indonesian analyzer could be reimplemented as a custom analyzer as follows:
{
"settings": {
"analysis": {
"filter": {
"indonesian_stop": {
"type": "stop",
"stopwords": "_indonesian_"
},
"indonesian_keywords": {
"type": "keyword_marker",
"keywords": []
},
"indonesian_stemmer": {
"type": "stemmer",
"language": "indonesian"
}
},
"analyzer": {
"indonesian": {
"tokenizer": "standard",
"filter": [
"lowercase",
"indonesian_stop",
"indonesian_keywords",
"indonesian_stemmer"
]
}
}
}
}
}
The default stopwords can be overridden with the stopwords
or stopwords_path parameters. |
|
| This filter should be removed unless there are words which should be excluded from stemming. |
irish analyzer
The irish analyzer could be reimplemented as a custom analyzer as follows:
{
"settings": {
"analysis": {
"filter": {
"irish_elision": {
"type": "elision",
"articles": [ "h", "n", "t" ]
},
"irish_stop": {
"type": "stop",
"stopwords": "_irish_"
},
"irish_lowercase": {
"type": "lowercase",
"language": "irish"
},
"irish_keywords": {
"type": "keyword_marker",
"keywords": []
},
"irish_stemmer": {
"type": "stemmer",
"language": "irish"
}
},
"analyzer": {
"irish": {
"tokenizer": "standard",
"filter": [
"irish_stop",
"irish_elision",
"irish_lowercase",
"irish_keywords",
"irish_stemmer"
]
}
}
}
}
}
The default stopwords can be overridden with the stopwords
or stopwords_path parameters. |
|
| This filter should be removed unless there are words which should be excluded from stemming. |
italian analyzer
The italian analyzer could be reimplemented as a custom analyzer as follows:
{
"settings": {
"analysis": {
"filter": {
"italian_elision": {
"type": "elision",
"articles": [
"c", "l", "all", "dall", "dell",
"nell", "sull", "coll", "pell",
"gl", "agl", "dagl", "degl", "negl",
"sugl", "un", "m", "t", "s", "v", "d"
]
},
"italian_stop": {
"type": "stop",
"stopwords": "_italian_"
},
"italian_keywords": {
"type": "keyword_marker",
"keywords": []
},
"italian_stemmer": {
"type": "stemmer",
"language": "light_italian"
}
},
"analyzer": {
"italian": {
"tokenizer": "standard",
"filter": [
"italian_elision",
"lowercase",
"italian_stop",
"italian_keywords",
"italian_stemmer"
]
}
}
}
}
}
The default stopwords can be overridden with the stopwords
or stopwords_path parameters. |
|
| This filter should be removed unless there are words which should be excluded from stemming. |
latvian analyzer
The latvian analyzer could be reimplemented as a custom analyzer as follows:
{
"settings": {
"analysis": {
"filter": {
"latvian_stop": {
"type": "stop",
"stopwords": "_latvian_"
},
"latvian_keywords": {
"type": "keyword_marker",
"keywords": []
},
"latvian_stemmer": {
"type": "stemmer",
"language": "latvian"
}
},
"analyzer": {
"latvian": {
"tokenizer": "standard",
"filter": [
"lowercase",
"latvian_stop",
"latvian_keywords",
"latvian_stemmer"
]
}
}
}
}
}
The default stopwords can be overridden with the stopwords
or stopwords_path parameters. |
|
| This filter should be removed unless there are words which should be excluded from stemming. |
lithuanian analyzer
The lithuanian analyzer could be reimplemented as a custom analyzer as follows:
{
"settings": {
"analysis": {
"filter": {
"lithuanian_stop": {
"type": "stop",
"stopwords": "_lithuanian_"
},
"lithuanian_keywords": {
"type": "keyword_marker",
"keywords": []
},
"lithuanian_stemmer": {
"type": "stemmer",
"language": "lithuanian"
}
},
"analyzer": {
"lithuanian": {
"tokenizer": "standard",
"filter": [
"lowercase",
"lithuanian_stop",
"lithuanian_keywords",
"lithuanian_stemmer"
]
}
}
}
}
}
The default stopwords can be overridden with the stopwords
or stopwords_path parameters. |
|
| This filter should be removed unless there are words which should be excluded from stemming. |
norwegian analyzer
The norwegian analyzer could be reimplemented as a custom analyzer as follows:
{
"settings": {
"analysis": {
"filter": {
"norwegian_stop": {
"type": "stop",
"stopwords": "_norwegian_"
},
"norwegian_keywords": {
"type": "keyword_marker",
"keywords": []
},
"norwegian_stemmer": {
"type": "stemmer",
"language": "norwegian"
}
},
"analyzer": {
"norwegian": {
"tokenizer": "standard",
"filter": [
"lowercase",
"norwegian_stop",
"norwegian_keywords",
"norwegian_stemmer"
]
}
}
}
}
}
The default stopwords can be overridden with the stopwords
or stopwords_path parameters. |
|
| This filter should be removed unless there are words which should be excluded from stemming. |
persian analyzer
The persian analyzer could be reimplemented as a custom analyzer as follows:
{
"settings": {
"analysis": {
"char_filter": {
"zero_width_spaces": {
"type": "mapping",
"mappings": [ "\\u200C=> "]
}
},
"filter": {
"persian_stop": {
"type": "stop",
"stopwords": "_persian_"
}
},
"analyzer": {
"persian": {
"tokenizer": "standard",
"char_filter": [ "zero_width_spaces" ],
"filter": [
"lowercase",
"arabic_normalization",
"persian_normalization",
"persian_stop"
]
}
}
}
}
}
| Replaces zero-width non-joiners with an ASCII space. | |
The default stopwords can be overridden with the stopwords
or stopwords_path parameters. |
portuguese analyzer
The portuguese analyzer could be reimplemented as a custom analyzer as follows:
{
"settings": {
"analysis": {
"filter": {
"portuguese_stop": {
"type": "stop",
"stopwords": "_portuguese_"
},
"portuguese_keywords": {
"type": "keyword_marker",
"keywords": []
},
"portuguese_stemmer": {
"type": "stemmer",
"language": "light_portuguese"
}
},
"analyzer": {
"portuguese": {
"tokenizer": "standard",
"filter": [
"lowercase",
"portuguese_stop",
"portuguese_keywords",
"portuguese_stemmer"
]
}
}
}
}
}
The default stopwords can be overridden with the stopwords
or stopwords_path parameters. |
|
| This filter should be removed unless there are words which should be excluded from stemming. |
romanian analyzer
The romanian analyzer could be reimplemented as a custom analyzer as follows:
{
"settings": {
"analysis": {
"filter": {
"romanian_stop": {
"type": "stop",
"stopwords": "_romanian_"
},
"romanian_keywords": {
"type": "keyword_marker",
"keywords": []
},
"romanian_stemmer": {
"type": "stemmer",
"language": "romanian"
}
},
"analyzer": {
"romanian": {
"tokenizer": "standard",
"filter": [
"lowercase",
"romanian_stop",
"romanian_keywords",
"romanian_stemmer"
]
}
}
}
}
}
The default stopwords can be overridden with the stopwords
or stopwords_path parameters. |
|
| This filter should be removed unless there are words which should be excluded from stemming. |
russian analyzer
The russian analyzer could be reimplemented as a custom analyzer as follows:
{
"settings": {
"analysis": {
"filter": {
"russian_stop": {
"type": "stop",
"stopwords": "_russian_"
},
"russian_keywords": {
"type": "keyword_marker",
"keywords": []
},
"russian_stemmer": {
"type": "stemmer",
"language": "russian"
}
},
"analyzer": {
"russian": {
"tokenizer": "standard",
"filter": [
"lowercase",
"russian_stop",
"russian_keywords",
"russian_stemmer"
]
}
}
}
}
}
The default stopwords can be overridden with the stopwords
or stopwords_path parameters. |
|
| This filter should be removed unless there are words which should be excluded from stemming. |
sorani analyzer
The sorani analyzer could be reimplemented as a custom analyzer as follows:
{
"settings": {
"analysis": {
"filter": {
"sorani_stop": {
"type": "stop",
"stopwords": "_sorani_"
},
"sorani_keywords": {
"type": "keyword_marker",
"keywords": []
},
"sorani_stemmer": {
"type": "stemmer",
"language": "sorani"
}
},
"analyzer": {
"sorani": {
"tokenizer": "standard",
"filter": [
"sorani_normalization",
"lowercase",
"sorani_stop",
"sorani_keywords",
"sorani_stemmer"
]
}
}
}
}
}
The default stopwords can be overridden with the stopwords
or stopwords_path parameters. |
|
| This filter should be removed unless there are words which should be excluded from stemming. |
spanish analyzer
The spanish analyzer could be reimplemented as a custom analyzer as follows:
{
"settings": {
"analysis": {
"filter": {
"spanish_stop": {
"type": "stop",
"stopwords": "_spanish_"
},
"spanish_keywords": {
"type": "keyword_marker",
"keywords": []
},
"spanish_stemmer": {
"type": "stemmer",
"language": "light_spanish"
}
},
"analyzer": {
"spanish": {
"tokenizer": "standard",
"filter": [
"lowercase",
"spanish_stop",
"spanish_keywords",
"spanish_stemmer"
]
}
}
}
}
}
The default stopwords can be overridden with the stopwords
or stopwords_path parameters. |
|
| This filter should be removed unless there are words which should be excluded from stemming. |
swedish analyzer
The swedish analyzer could be reimplemented as a custom analyzer as follows:
{
"settings": {
"analysis": {
"filter": {
"swedish_stop": {
"type": "stop",
"stopwords": "_swedish_"
},
"swedish_keywords": {
"type": "keyword_marker",
"keywords": []
},
"swedish_stemmer": {
"type": "stemmer",
"language": "swedish"
}
},
"analyzer": {
"swedish": {
"tokenizer": "standard",
"filter": [
"lowercase",
"swedish_stop",
"swedish_keywords",
"swedish_stemmer"
]
}
}
}
}
}
The default stopwords can be overridden with the stopwords
or stopwords_path parameters. |
|
| This filter should be removed unless there are words which should be excluded from stemming. |
turkish analyzer
The turkish analyzer could be reimplemented as a custom analyzer as follows:
{
"settings": {
"analysis": {
"filter": {
"turkish_stop": {
"type": "stop",
"stopwords": "_turkish_"
},
"turkish_lowercase": {
"type": "lowercase",
"language": "turkish"
},
"turkish_keywords": {
"type": "keyword_marker",
"keywords": []
},
"turkish_stemmer": {
"type": "stemmer",
"language": "turkish"
}
},
"analyzer": {
"turkish": {
"tokenizer": "standard",
"filter": [
"apostrophe",
"turkish_lowercase",
"turkish_stop",
"turkish_keywords",
"turkish_stemmer"
]
}
}
}
}
}
The default stopwords can be overridden with the stopwords
or stopwords_path parameters. |
|
| This filter should be removed unless there are words which should be excluded from stemming. |
thai analyzer
The thai analyzer could be reimplemented as a custom analyzer as follows:
{
"settings": {
"analysis": {
"filter": {
"thai_stop": {
"type": "stop",
"stopwords": "_thai_"
}
},
"analyzer": {
"thai": {
"tokenizer": "thai",
"filter": [
"lowercase",
"thai_stop"
]
}
}
}
}
}
The default stopwords can be overridden with the stopwords
or stopwords_path parameters. |
123.8. Snowball Analyzer
An analyzer of type snowball that uses the
standard
tokenizer, with
standard
filter,
lowercase
filter,
stop
filter, and
snowball
filter.
The Snowball Analyzer is a stemming analyzer from Lucene that is originally based on the snowball project from snowballstem.org.
Sample usage:
{
"index" : {
"analysis" : {
"analyzer" : {
"my_analyzer" : {
"type" : "snowball",
"language" : "English"
}
}
}
}
}
The language parameter can have the same values as the
snowball
filter and defaults to English. Note that not all the language
analyzers have a default set of stopwords provided.
The stopwords parameter can be used to provide stopwords for the
languages that have no defaults, or to simply replace the default set
with your custom list. Check Stop Analyzer
for more details. A default set of stopwords for many of these
languages is available from for instance
here
and
here.
A sample configuration (in YAML format) specifying Swedish with stopwords:
index :
analysis :
analyzer :
my_analyzer:
type: snowball
language: Swedish
stopwords: "och,det,att,i,en,jag,hon,som,han,på,den,med,var,sig,för,så,till,är,men,ett,om,hade,de,av,icke,mig,du,henne,då,sin,nu,har,inte,hans,honom,skulle,hennes,där,min,man,ej,vid,kunde,något,från,ut,när,efter,upp,vi,dem,vara,vad,över,än,dig,kan,sina,här,ha,mot,alla,under,någon,allt,mycket,sedan,ju,denna,själv,detta,åt,utan,varit,hur,ingen,mitt,ni,bli,blev,oss,din,dessa,några,deras,blir,mina,samma,vilken,er,sådan,vår,blivit,dess,inom,mellan,sådant,varför,varje,vilka,ditt,vem,vilket,sitta,sådana,vart,dina,vars,vårt,våra,ert,era,vilkas"
123.9. Custom Analyzer
An analyzer of type custom that allows to combine a Tokenizer with
zero or more Token Filters, and zero or more Char Filters. The
custom analyzer accepts a logical/registered name of the tokenizer to
use, and a list of logical/registered names of token filters.
The name of the custom analyzer must not start with "_".
The following are settings that can be set for a custom analyzer type:
| Setting | Description |
|---|---|
|
The logical / registered name of the tokenizer to use. |
|
An optional list of logical / registered name of token filters. |
|
An optional list of logical / registered name of char filters. |
|
An optional number of positions to increment between each field value of a field using this analyzer. Defaults to 100. 100 was chosen because it prevents phrase queries with reasonably large slops (less than 100) from matching terms across field values. |
Here is an example:
index :
analysis :
analyzer :
myAnalyzer2 :
type : custom
tokenizer : myTokenizer1
filter : [myTokenFilter1, myTokenFilter2]
char_filter : [my_html]
position_increment_gap: 256
tokenizer :
myTokenizer1 :
type : standard
max_token_length : 900
filter :
myTokenFilter1 :
type : stop
stopwords : [stop1, stop2, stop3, stop4]
myTokenFilter2 :
type : length
min : 0
max : 2000
char_filter :
my_html :
type : html_strip
escaped_tags : [xxx, yyy]
read_ahead : 1024
124. Tokenizers
Tokenizers are used to break a string down into a stream of terms or tokens. A simple tokenizer might split the string up into terms wherever it encounters whitespace or punctuation.
Elasticsearch has a number of built in tokenizers which can be used to build custom analyzers.
124.1. Standard Tokenizer
A tokenizer of type standard providing grammar based tokenizer that is
a good tokenizer for most European language documents. The tokenizer
implements the Unicode Text Segmentation algorithm, as specified in
Unicode Standard Annex #29.
The following are settings that can be set for a standard tokenizer
type:
| Setting | Description |
|---|---|
|
The maximum token length. If a token is seen that
exceeds this length then it is split at |
124.2. Edge NGram Tokenizer
A tokenizer of type edgeNGram.
This tokenizer is very similar to nGram but only keeps n-grams which
start at the beginning of a token.
The following are settings that can be set for a edgeNGram tokenizer
type:
| Setting | Description | Default value |
|---|---|---|
|
Minimum size in codepoints of a single n-gram |
|
|
Maximum size in codepoints of a single n-gram |
|
|
Characters classes to keep in the tokens, Elasticsearch will split on characters that don’t belong to any of these classes. |
|
token_chars accepts the following character classes:
letter
|
for example |
digit
|
for example |
whitespace
|
for example |
punctuation
|
for example |
symbol
|
for example |
Example
curl -XPUT 'localhost:9200/test' -d '
{
"settings" : {
"analysis" : {
"analyzer" : {
"my_edge_ngram_analyzer" : {
"tokenizer" : "my_edge_ngram_tokenizer"
}
},
"tokenizer" : {
"my_edge_ngram_tokenizer" : {
"type" : "edgeNGram",
"min_gram" : "2",
"max_gram" : "5",
"token_chars": [ "letter", "digit" ]
}
}
}
}
}'
curl 'localhost:9200/test/_analyze?pretty=1&analyzer=my_edge_ngram_analyzer' -d 'FC Schalke 04'
# FC, Sc, Sch, Scha, Schal, 04
side deprecated
There used to be a side parameter up to 0.90.1 but it is now deprecated. In
order to emulate the behavior of "side" : "BACK" a
reverse token filter should be used together
with the edgeNGram token filter. The
edgeNGram filter must be enclosed in reverse filters like this:
"filter" : ["reverse", "edgeNGram", "reverse"]
which essentially reverses the token, builds front EdgeNGrams and reverses
the ngram again. This has the same effect as the previous "side" : "BACK" setting.
124.3. Keyword Tokenizer
A tokenizer of type keyword that emits the entire input as a single
output.
The following are settings that can be set for a keyword tokenizer
type:
| Setting | Description |
|---|---|
|
The term buffer size. Defaults to |
124.4. Letter Tokenizer
A tokenizer of type letter that divides text at non-letters. That’s to
say, it defines tokens as maximal strings of adjacent letters. Note,
this does a decent job for most European languages, but does a terrible
job for some Asian languages, where words are not separated by spaces.
124.5. Lowercase Tokenizer
A tokenizer of type lowercase that performs the function of
Letter
Tokenizer and
Lower
Case Token Filter together. It divides text at non-letters and converts
them to lower case. While it is functionally equivalent to the
combination of
Letter
Tokenizer and
Lower
Case Token Filter, there is a performance advantage to doing the two
tasks at once, hence this (redundant) implementation.
124.6. NGram Tokenizer
A tokenizer of type nGram.
The following are settings that can be set for a nGram tokenizer type:
| Setting | Description | Default value |
|---|---|---|
|
Minimum size in codepoints of a single n-gram |
|
|
Maximum size in codepoints of a single n-gram |
|
|
Characters classes to keep in the tokens, Elasticsearch will split on characters that don’t belong to any of these classes. |
|
token_chars accepts the following character classes:
letter
|
for example |
digit
|
for example |
whitespace
|
for example |
punctuation
|
for example |
symbol
|
for example |
Example
curl -XPUT 'localhost:9200/test' -d '
{
"settings" : {
"analysis" : {
"analyzer" : {
"my_ngram_analyzer" : {
"tokenizer" : "my_ngram_tokenizer"
}
},
"tokenizer" : {
"my_ngram_tokenizer" : {
"type" : "nGram",
"min_gram" : "2",
"max_gram" : "3",
"token_chars": [ "letter", "digit" ]
}
}
}
}
}'
curl 'localhost:9200/test/_analyze?pretty=1&analyzer=my_ngram_analyzer' -d 'FC Schalke 04'
# FC, Sc, Sch, ch, cha, ha, hal, al, alk, lk, lke, ke, 04
124.8. Pattern Tokenizer
A tokenizer of type pattern that can flexibly separate text into terms
via a regular expression. Accepts the following settings:
| Setting | Description |
|---|---|
|
The regular expression pattern, defaults to |
|
The regular expression flags. |
|
Which group to extract into tokens. Defaults to |
IMPORTANT: The regular expression should match the token separators, not the tokens themselves.
group set to -1 (the default) is equivalent to "split". Using group
>= 0 selects the matching group as the token. For example, if you have:
pattern = '([^']+)' group = 0 input = aaa 'bbb' 'ccc'
the output will be two tokens: 'bbb' and 'ccc' (including the '
marks). With the same input but using group=1, the output would be:
bbb and ccc (no ' marks).
124.9. UAX Email URL Tokenizer
A tokenizer of type uax_url_email which works exactly like the
standard tokenizer, but tokenizes emails and urls as single tokens.
The following are settings that can be set for a uax_url_email
tokenizer type:
| Setting | Description |
|---|---|
|
The maximum token length. If a token is seen that
exceeds this length then it is discarded. Defaults to |
124.10. Path Hierarchy Tokenizer
The path_hierarchy tokenizer takes something like this:
/something/something/else
And produces tokens:
/something /something/something /something/something/else
| Setting | Description |
|---|---|
|
The character delimiter to use, defaults to |
|
An optional replacement character to use. Defaults to
the |
|
The buffer size to use, defaults to |
|
Generates tokens in reverse order, defaults to |
|
Controls initial tokens to skip, defaults to |
124.11. Classic Tokenizer
A tokenizer of type classic providing grammar based tokenizer that is
a good tokenizer for English language documents. This tokenizer has
heuristics for special treatment of acronyms, company names, email addresses,
and internet host names. However, these rules don’t always work, and
the tokenizer doesn’t work well for most languages other than English.
The following are settings that can be set for a classic tokenizer
type:
| Setting | Description |
|---|---|
|
The maximum token length. If a token is seen that
exceeds this length then it is discarded. Defaults to |
125. Token Filters
Token filters accept a stream of tokens from a tokenizer and can modify tokens (eg lowercasing), delete tokens (eg remove stopwords) or add tokens (eg synonyms).
Elasticsearch has a number of built in token filters which can be used to build custom analyzers.
125.1. Standard Token Filter
A token filter of type standard that normalizes tokens extracted with
the
Standard
Tokenizer.
|
|
The |
125.2. ASCII Folding Token Filter
A token filter of type asciifolding that converts alphabetic, numeric,
and symbolic Unicode characters which are not in the first 127 ASCII
characters (the "Basic Latin" Unicode block) into their ASCII
equivalents, if one exists. Example:
"index" : {
"analysis" : {
"analyzer" : {
"default" : {
"tokenizer" : "standard",
"filter" : ["standard", "asciifolding"]
}
}
}
}
Accepts preserve_original setting which defaults to false but if true
will keep the original token as well as emit the folded token. For
example:
"index" : {
"analysis" : {
"analyzer" : {
"default" : {
"tokenizer" : "standard",
"filter" : ["standard", "my_ascii_folding"]
}
},
"filter" : {
"my_ascii_folding" : {
"type" : "asciifolding",
"preserve_original" : true
}
}
}
}
125.3. Length Token Filter
A token filter of type length that removes words that are too long or
too short for the stream.
The following are settings that can be set for a length token filter
type:
| Setting | Description |
|---|---|
|
The minimum number. Defaults to |
|
The maximum number. Defaults to |
125.4. Lowercase Token Filter
A token filter of type lowercase that normalizes token text to lower
case.
Lowercase token filter supports Greek, Irish, and Turkish lowercase token
filters through the language parameter. Below is a usage example in a
custom analyzer
index :
analysis :
analyzer :
myAnalyzer2 :
type : custom
tokenizer : myTokenizer1
filter : [myTokenFilter1, myGreekLowerCaseFilter]
char_filter : [my_html]
tokenizer :
myTokenizer1 :
type : standard
max_token_length : 900
filter :
myTokenFilter1 :
type : stop
stopwords : [stop1, stop2, stop3, stop4]
myGreekLowerCaseFilter :
type : lowercase
language : greek
char_filter :
my_html :
type : html_strip
escaped_tags : [xxx, yyy]
read_ahead : 1024
125.5. Uppercase Token Filter
A token filter of type uppercase that normalizes token text to upper
case.
125.6. NGram Token Filter
A token filter of type nGram.
The following are settings that can be set for a nGram token filter
type:
| Setting | Description |
|---|---|
|
Defaults to |
|
Defaults to |
125.7. Edge NGram Token Filter
A token filter of type edgeNGram.
The following are settings that can be set for a edgeNGram token
filter type:
| Setting | Description |
|---|---|
|
Defaults to |
|
Defaults to |
|
deprecated. Either |
125.8. Porter Stem Token Filter
A token filter of type porter_stem that transforms the token stream as
per the Porter stemming algorithm.
Note, the input to the stemming filter must already be in lower case, so
you will need to use
Lower
Case Token Filter or
Lower
Case Tokenizer farther down the Tokenizer chain in order for this to
work properly!. For example, when using custom analyzer, make sure the
lowercase filter comes before the porter_stem filter in the list of
filters.
125.9. Shingle Token Filter
A token filter of type shingle that constructs shingles (token
n-grams) from a token stream. In other words, it creates combinations of
tokens as a single token. For example, the sentence "please divide this
sentence into shingles" might be tokenized into shingles "please
divide", "divide this", "this sentence", "sentence into", and "into
shingles".
This filter handles position increments > 1 by inserting filler tokens (tokens with termtext "_"). It does not handle a position increment of 0.
The following are settings that can be set for a shingle token filter
type:
| Setting | Description |
|---|---|
|
The maximum shingle size. Defaults to |
|
The minimum shingle size. Defaults to |
|
If |
|
If |
|
The string to use when joining adjacent tokens to
form a shingle. Defaults to |
|
The string to use as a replacement for each position
at which there is no actual token in the stream. For instance this string is
used if the position increment is greater than one when a |
125.10. Stop Token Filter
A token filter of type stop that removes stop words from token
streams.
The following are settings that can be set for a stop token filter
type:
stopwords
|
A list of stop words to use. Defaults to |
stopwords_path
|
A path (either relative to |
ignore_case
|
Set to |
remove_trailing
|
Set to |
The stopwords parameter accepts either an array of stopwords:
PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"my_stop": {
"type": "stop",
"stopwords": ["and", "is", "the"]
}
}
}
}
}
or a predefined language-specific list:
PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"my_stop": {
"type": "stop",
"stopwords": "_english_"
}
}
}
}
}
Elasticsearch provides the following predefined list of languages:
_arabic_, _armenian_, _basque_, _brazilian_, _bulgarian_,
_catalan_, _czech_, _danish_, _dutch_, _english_, _finnish_,
_french_, _galician_, _german_, _greek_, _hindi_, _hungarian_,
_indonesian_, _irish_, _italian_, _latvian_, _norwegian_, _persian_,
_portuguese_, _romanian_, _russian_, _sorani_, _spanish_,
_swedish_, _thai_, _turkish_.
For the empty stopwords list (to disable stopwords) use: _none_.
125.11. Word Delimiter Token Filter
Named word_delimiter, it Splits words into subwords and performs
optional transformations on subword groups. Words are split into
subwords with the following rules:
-
split on intra-word delimiters (by default, all non alpha-numeric characters).
-
"Wi-Fi" → "Wi", "Fi"
-
split on case transitions: "PowerShot" → "Power", "Shot"
-
split on letter-number transitions: "SD500" → "SD", "500"
-
leading and trailing intra-word delimiters on each subword are ignored: "//hello---there, dude" → "hello", "there", "dude"
-
trailing "'s" are removed for each subword: "O’Neil’s" → "O", "Neil"
Parameters include:
generate_word_parts-
If
truecauses parts of words to be generated: "PowerShot" ⇒ "Power" "Shot". Defaults totrue. generate_number_parts-
If
truecauses number subwords to be generated: "500-42" ⇒ "500" "42". Defaults totrue. catenate_words-
If
truecauses maximum runs of word parts to be catenated: "wi-fi" ⇒ "wifi". Defaults tofalse. catenate_numbers-
If
truecauses maximum runs of number parts to be catenated: "500-42" ⇒ "50042". Defaults tofalse. catenate_all-
If
truecauses all subword parts to be catenated: "wi-fi-4000" ⇒ "wifi4000". Defaults tofalse. split_on_case_change-
If
truecauses "PowerShot" to be two tokens; ("Power-Shot" remains two parts regards). Defaults totrue. preserve_original-
If
trueincludes original words in subwords: "500-42" ⇒ "500-42" "500" "42". Defaults tofalse. split_on_numerics-
If
truecauses "j2se" to be three tokens; "j" "2" "se". Defaults totrue. stem_english_possessive-
If
truecauses trailing "'s" to be removed for each subword: "O’Neil’s" ⇒ "O", "Neil". Defaults totrue.
Advance settings include:
protected_words-
A list of protected words from being delimiter. Either an array, or also can set
protected_words_pathwhich resolved to a file configured with protected words (one on each line). Automatically resolves toconfig/based location if exists. type_table-
A custom type mapping table, for example (when configured using
type_table_path):
# Map the $, %, '.', and ',' characters to DIGIT
# This might be useful for financial data.
$ => DIGIT
% => DIGIT
. => DIGIT
\\u002C => DIGIT
# in some cases you might not want to split on ZWJ
# this also tests the case where we need a bigger byte[]
# see http://en.wikipedia.org/wiki/Zero-width_joiner
\\u200D => ALPHANUM
|
|
Using a tokenizer like the standard tokenizer may interfere with
the catenate_* and preserve_original parameters, as the original
string may already have lost punctuation during tokenization. Instead,
you may want to use the whitespace tokenizer.
|
125.12. Stemmer Token Filter
A filter that provides access to (almost) all of the available stemming token filters through a single unified interface. For example:
{
"index" : {
"analysis" : {
"analyzer" : {
"my_analyzer" : {
"tokenizer" : "standard",
"filter" : ["standard", "lowercase", "my_stemmer"]
}
},
"filter" : {
"my_stemmer" : {
"type" : "stemmer",
"name" : "light_german"
}
}
}
}
}
The language/name parameter controls the stemmer with the following
available values (the preferred filters are marked in bold):
| Arabic | |
| Armenian | |
| Basque | |
| Brazilian Portuguese | |
| Bulgarian | |
| Catalan | |
| Czech | |
| Danish | |
| Dutch | |
| English |
|
| Finnish | |
| French | |
| Galician |
|
| German | |
| Greek | |
| Hindi | |
| Hungarian | |
| Indonesian | |
| Irish | |
| Italian | |
| Kurdish (Sorani) | |
| Latvian | |
| Lithuanian | |
| Norwegian (Bokmål) | |
| Norwegian (Nynorsk) | |
| Portuguese |
|
| Romanian | |
| Russian | |
| Spanish | |
| Swedish | |
| Turkish |
125.13. Stemmer Override Token Filter
Overrides stemming algorithms, by applying a custom mapping, then protecting these terms from being modified by stemmers. Must be placed before any stemming filters.
Rules are separated by =>
| Setting | Description |
|---|---|
|
A list of mapping rules to use. |
|
A path (either relative to |
Here is an example:
index :
analysis :
analyzer :
myAnalyzer :
type : custom
tokenizer : standard
filter : [lowercase, custom_stems, porter_stem]
filter:
custom_stems:
type: stemmer_override
rules_path : analysis/custom_stems.txt
125.14. Keyword Marker Token Filter
Protects words from being modified by stemmers. Must be placed before any stemming filters.
| Setting | Description |
|---|---|
|
A list of words to use. |
|
A path (either relative to |
|
Set to |
Here is an example:
index :
analysis :
analyzer :
myAnalyzer :
type : custom
tokenizer : standard
filter : [lowercase, protwords, porter_stem]
filter :
protwords :
type : keyword_marker
keywords_path : analysis/protwords.txt
125.15. Keyword Repeat Token Filter
The keyword_repeat token filter Emits each incoming token twice once
as keyword and once as a non-keyword to allow an unstemmed version of a
term to be indexed side by side with the stemmed version of the term.
Given the nature of this filter each token that isn’t transformed by a
subsequent stemmer will be indexed twice. Therefore, consider adding a
unique filter with only_on_same_position set to true to drop
unnecessary duplicates.
Here is an example:
index :
analysis :
analyzer :
myAnalyzer :
type : custom
tokenizer : standard
filter : [lowercase, keyword_repeat, porter_stem, unique_stem]
unique_stem:
type: unique
only_on_same_position : true
125.16. KStem Token Filter
The kstem token filter is a high performance filter for english. All
terms must already be lowercased (use lowercase filter) for this
filter to work correctly.
125.17. Snowball Token Filter
A filter that stems words using a Snowball-generated stemmer. The
language parameter controls the stemmer with the following available
values: Armenian, Basque, Catalan, Danish, Dutch, English,
Finnish, French, German, German2, Hungarian, Italian, Kp,
Lithuanian, Lovins, Norwegian, Porter, Portuguese, Romanian,
Russian, Spanish, Swedish, Turkish.
For example:
{
"index" : {
"analysis" : {
"analyzer" : {
"my_analyzer" : {
"tokenizer" : "standard",
"filter" : ["standard", "lowercase", "my_snow"]
}
},
"filter" : {
"my_snow" : {
"type" : "snowball",
"language" : "Lovins"
}
}
}
}
}
125.18. Phonetic Token Filter
The phonetic token filter is provided as a plugin and located
here.
125.19. Synonym Token Filter
The synonym token filter allows to easily handle synonyms during the
analysis process. Synonyms are configured using a configuration file.
Here is an example:
{
"index" : {
"analysis" : {
"analyzer" : {
"synonym" : {
"tokenizer" : "whitespace",
"filter" : ["synonym"]
}
},
"filter" : {
"synonym" : {
"type" : "synonym",
"synonyms_path" : "analysis/synonym.txt"
}
}
}
}
}
The above configures a synonym filter, with a path of
analysis/synonym.txt (relative to the config location). The
synonym analyzer is then configured with the filter. Additional
settings are: ignore_case (defaults to false), and expand
(defaults to true).
The tokenizer parameter controls the tokenizers that will be used to
tokenize the synonym, and defaults to the whitespace tokenizer.
Two synonym formats are supported: Solr, WordNet.
Solr synonyms
The following is a sample format of the file:
# blank lines and lines starting with pound are comments.
#Explicit mappings match any token sequence on the LHS of "=>"
#and replace with all alternatives on the RHS. These types of mappings
#ignore the expand parameter in the schema.
#Examples:
i-pod, i pod => ipod,
sea biscuit, sea biscit => seabiscuit
#Equivalent synonyms may be separated with commas and give
#no explicit mapping. In this case the mapping behavior will
#be taken from the expand parameter in the schema. This allows
#the same synonym file to be used in different synonym handling strategies.
#Examples:
ipod, i-pod, i pod
foozball , foosball
universe , cosmos
# If expand==true, "ipod, i-pod, i pod" is equivalent
# to the explicit mapping:
ipod, i-pod, i pod => ipod, i-pod, i pod
# If expand==false, "ipod, i-pod, i pod" is equivalent
# to the explicit mapping:
ipod, i-pod, i pod => ipod
#multiple synonym mapping entries are merged.
foo => foo bar
foo => baz
#is equivalent to
foo => foo bar, baz
You can also define synonyms for the filter directly in the
configuration file (note use of synonyms instead of synonyms_path):
{
"filter" : {
"synonym" : {
"type" : "synonym",
"synonyms" : [
"i-pod, i pod => ipod",
"universe, cosmos"
]
}
}
}
However, it is recommended to define large synonyms set in a file using
synonyms_path.
WordNet synonyms
Synonyms based on WordNet format can be
declared using format:
{
"filter" : {
"synonym" : {
"type" : "synonym",
"format" : "wordnet",
"synonyms" : [
"s(100000001,1,'abstain',v,1,0).",
"s(100000001,2,'refrain',v,1,0).",
"s(100000001,3,'desist',v,1,0)."
]
}
}
}
Using synonyms_path to define WordNet synonyms in a file is supported
as well.
125.20. Compound Word Token Filter
The hyphenation_decompounder and dictionary_decompounder token filters can
decompose compound words found in many German languages into word parts.
Both token filters require a dictionary of word parts, which can be provided as:
word_list
|
An array of words, specified inline in the token filter configuration, or |
word_list_path
|
The path (either absolute or relative to the |
Hyphenation decompounder
The hyphenation_decompounder uses hyphenation grammars to find potential
subwords that are then checked against the word dictionary. The quality of the
output tokens is directly connected to the quality of the grammar file you
use. For languages like German they are quite good.
XML based hyphenation grammar files can be found in the
Objects For Formatting Objects
(OFFO) Sourceforge project. Currently only FOP v1.2 compatible hyphenation files
are supported. You can download offo-hyphenation_v1.2.zip
directly and look in the offo-hyphenation/hyph/ directory.
Credits for the hyphenation code go to the Apache FOP project .
Dictionary decompounder
The dictionary_decompounder uses a brute force approach in conjunction with
only the word dictionary to find subwords in a compound word. It is much
slower than the hyphenation decompounder but can be used as a first start to
check the quality of your dictionary.
Compound token filter parameters
The following parameters can be used to configure a compound word token filter:
type
|
Either |
word_list
|
A array containing a list of words to use for the word dictionary. |
word_list_path
|
The path (either absolute or relative to the |
hyphenation_patterns_path
|
The path (either absolute or relative to the |
min_word_size
|
Minimum word size. Defaults to 5. |
min_subword_size
|
Minimum subword size. Defaults to 2. |
max_subword_size
|
Maximum subword size. Defaults to 15. |
only_longest_match
|
Whether to include only the longest matching subword or not. Defaults to |
Here is an example:
index :
analysis :
analyzer :
myAnalyzer2 :
type : custom
tokenizer : standard
filter : [myTokenFilter1, myTokenFilter2]
filter :
myTokenFilter1 :
type : dictionary_decompounder
word_list: [one, two, three]
myTokenFilter2 :
type : hyphenation_decompounder
word_list_path: path/to/words.txt
hyphenation_patterns_path: path/to/fop.xml
max_subword_size : 22
125.22. Elision Token Filter
A token filter which removes elisions. For example, "l’avion" (the plane) will tokenized as "avion" (plane).
Accepts articles setting which is a set of stop words articles. For
example:
"index" : {
"analysis" : {
"analyzer" : {
"default" : {
"tokenizer" : "standard",
"filter" : ["standard", "elision"]
}
},
"filter" : {
"elision" : {
"type" : "elision",
"articles" : ["l", "m", "t", "qu", "n", "s", "j"]
}
}
}
}
125.23. Truncate Token Filter
The truncate token filter can be used to truncate tokens into a
specific length. This can come in handy with keyword (single token)
based mapped fields that are used for sorting in order to reduce memory
usage.
It accepts a length parameter which control the number of characters
to truncate to, defaults to 10.
125.24. Unique Token Filter
The unique token filter can be used to only index unique tokens during
analysis. By default it is applied on all the token stream. If
only_on_same_position is set to true, it will only remove duplicate
tokens on the same position.
125.25. Pattern Capture Token Filter
The pattern_capture token filter, unlike the pattern tokenizer,
emits a token for every capture group in the regular expression.
Patterns are not anchored to the beginning and end of the string, so
each pattern can match multiple times, and matches are allowed to
overlap.
For instance a pattern like :
"(([a-z]+)(\d*))"
when matched against:
"abc123def456"
would produce the tokens: [ abc123, abc, 123, def456, def,
456 ]
If preserve_original is set to true (the default) then it would also
emit the original token: abc123def456.
This is particularly useful for indexing text like camel-case code, eg
stripHTML where a user may search for "strip html" or "striphtml":
curl -XPUT localhost:9200/test/ -d '
{
"settings" : {
"analysis" : {
"filter" : {
"code" : {
"type" : "pattern_capture",
"preserve_original" : 1,
"patterns" : [
"(\\p{Ll}+|\\p{Lu}\\p{Ll}+|\\p{Lu}+)",
"(\\d+)"
]
}
},
"analyzer" : {
"code" : {
"tokenizer" : "pattern",
"filter" : [ "code", "lowercase" ]
}
}
}
}
}
'
When used to analyze the text
import static org.apache.commons.lang.StringEscapeUtils.escapeHtml
this emits the tokens: [ import, static, org, apache, commons,
lang, stringescapeutils, string, escape, utils, escapehtml,
escape, html ]
Another example is analyzing email addresses:
curl -XPUT localhost:9200/test/ -d '
{
"settings" : {
"analysis" : {
"filter" : {
"email" : {
"type" : "pattern_capture",
"preserve_original" : 1,
"patterns" : [
"([^@]+)",
"(\\p{L}+)",
"(\\d+)",
"@(.+)"
]
}
},
"analyzer" : {
"email" : {
"tokenizer" : "uax_url_email",
"filter" : [ "email", "lowercase", "unique" ]
}
}
}
}
}
'
When the above analyzer is used on an email address like:
john-smith_123@foo-bar.com
it would produce the following tokens:
john-smith_123@foo-bar.com, john-smith_123, john, smith, 123, foo-bar.com, foo, bar, com
Multiple patterns are required to allow overlapping captures, but also means that patterns are less dense and easier to understand.
Note: All tokens are emitted in the same position, and with the same
character offsets, so when combined with highlighting, the whole
original token will be highlighted, not just the matching subset. For
instance, querying the above email address for "smith" would
highlight:
<em>john-smith_123@foo-bar.com</em>
not:
john-<em>smith</em>_123@foo-bar.com
125.26. Pattern Replace Token Filter
The pattern_replace token filter allows to easily handle string
replacements based on a regular expression. The regular expression is
defined using the pattern parameter, and the replacement string can be
provided using the replacement parameter (supporting referencing the
original text, as explained
here).
125.28. Limit Token Count Token Filter
Limits the number of tokens that are indexed per document and field.
| Setting | Description |
|---|---|
|
The maximum number of tokens that should be indexed
per document and field. The default is |
|
If set to |
Here is an example:
index :
analysis :
analyzer :
myAnalyzer :
type : custom
tokenizer : standard
filter : [lowercase, five_token_limit]
filter :
five_token_limit :
type : limit
max_token_count : 5
125.29. Hunspell Token Filter
Basic support for hunspell stemming. Hunspell dictionaries will be
picked up from a dedicated hunspell directory on the filesystem
(<path.conf>/hunspell). Each dictionary is expected to
have its own directory named after its associated locale (language).
This dictionary directory is expected to hold a single *.aff and
one or more *.dic files (all of which will automatically be picked up).
For example, assuming the default hunspell location is used, the
following directory layout will define the en_US dictionary:
- conf
|-- hunspell
| |-- en_US
| | |-- en_US.dic
| | |-- en_US.aff
Each dictionary can be configured with one setting:
ignore_case-
If true, dictionary matching will be case insensitive (defaults to
false)
This setting can be configured globally in elasticsearch.yml using
-
indices.analysis.hunspell.dictionary.ignore_case
or for specific dictionaries:
-
indices.analysis.hunspell.dictionary.en_US.ignore_case.
It is also possible to add settings.yml file under the dictionary
directory which holds these settings (this will override any other
settings defined in the elasticsearch.yml).
One can use the hunspell stem filter by configuring it the analysis settings:
{
"analysis" : {
"analyzer" : {
"en" : {
"tokenizer" : "standard",
"filter" : [ "lowercase", "en_US" ]
}
},
"filter" : {
"en_US" : {
"type" : "hunspell",
"locale" : "en_US",
"dedup" : true
}
}
}
}
The hunspell token filter accepts four options:
locale-
A locale for this filter. If this is unset, the
langorlanguageare used instead - so one of these has to be set. dictionary-
The name of a dictionary. The path to your hunspell dictionaries should be configured via
indices.analysis.hunspell.dictionary.locationbefore. dedup-
If only unique terms should be returned, this needs to be set to
true. Defaults totrue. longest_only-
If only the longest term should be returned, set this to
true. Defaults tofalse: all possible stems are returned.
|
|
As opposed to the snowball stemmers (which are algorithm based) this is a dictionary lookup based stemmer and therefore the quality of the stemming is determined by the quality of the dictionary. |
Dictionary loading
By default, the default Hunspell directory (config/hunspell/) is checked
for dictionaries when the node starts up, and any dictionaries are
automatically loaded.
Dictionary loading can be deferred until they are actually used by setting
indices.analysis.hunspell.dictionary.lazy to `true`in the config file.
References
Hunspell is a spell checker and morphological analyzer designed for languages with rich morphology and complex word compounding and character encoding.
-
Wikipedia, http://en.wikipedia.org/wiki/Hunspell
-
Source code, http://hunspell.sourceforge.net/
-
Open Office Hunspell dictionaries, http://wiki.openoffice.org/wiki/Dictionaries
-
Mozilla Hunspell dictionaries, https://addons.mozilla.org/en-US/firefox/language-tools/
-
Chromium Hunspell dictionaries, http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/
125.30. Common Grams Token Filter
Token filter that generates bigrams for frequently occurring terms. Single terms are still indexed. It can be used as an alternative to the Stop Token Filter when we don’t want to completely ignore common terms.
For example, the text "the quick brown is a fox" will be tokenized as "the", "the_quick", "quick", "brown", "brown_is", "is_a", "a_fox", "fox". Assuming "the", "is" and "a" are common words.
When query_mode is enabled, the token filter removes common words and
single terms followed by a common word. This parameter should be enabled
in the search analyzer.
For example, the query "the quick brown is a fox" will be tokenized as "the_quick", "quick", "brown_is", "is_a", "a_fox", "fox".
The following are settings that can be set:
| Setting | Description |
|---|---|
|
A list of common words to use. |
|
A path (either relative to |
|
If true, common words matching will be case insensitive
(defaults to |
|
Generates bigrams then removes common words and single
terms followed by a common word (defaults to |
Note, common_words or common_words_path field is required.
Here is an example:
index :
analysis :
analyzer :
index_grams :
tokenizer : whitespace
filter : [common_grams]
search_grams :
tokenizer : whitespace
filter : [common_grams_query]
filter :
common_grams :
type : common_grams
common_words: [a, an, the]
common_grams_query :
type : common_grams
query_mode: true
common_words: [a, an, the]
125.31. Normalization Token Filter
There are several token filters available which try to normalize special characters of a certain language.
| Arabic | |
| German | |
| Hindi | |
| Indic | |
| Kurdish (Sorani) | |
| Persian | |
| Scandinavian | |
| Serbian |
not-released-yet[ |
125.32. CJK Width Token Filter
The cjk_width token filter normalizes CJK width differences:
-
Folds fullwidth ASCII variants into the equivalent basic Latin
-
Folds halfwidth Katakana variants into the equivalent Kana
|
|
This token filter can be viewed as a subset of NFKC/NFKD
Unicode normalization. See the analysis-icu plugin
for full normalization support.
|
125.33. CJK Bigram Token Filter
The cjk_bigram token filter forms bigrams out of the CJK
terms that are generated by the standard tokenizer
or the icu_tokenizer (see analysis-icu plugin).
By default, when a CJK character has no adjacent characters to form a bigram,
it is output in unigram form. If you always want to output both unigrams and
bigrams, set the output_unigrams flag to true. This can be used for a
combined unigram+bigram approach.
Bigrams are generated for characters in han, hiragana, katakana and
hangul, but bigrams can be disabled for particular scripts with the
ignored_scripts parameter. All non-CJK input is passed through unmodified.
{
"index" : {
"analysis" : {
"analyzer" : {
"han_bigrams" : {
"tokenizer" : "standard",
"filter" : ["han_bigrams_filter"]
}
},
"filter" : {
"han_bigrams_filter" : {
"type" : "cjk_bigram",
"ignored_scripts": [
"hiragana",
"katakana",
"hangul"
],
"output_unigrams" : true
}
}
}
}
}
125.34. Delimited Payload Token Filter
Named delimited_payload_filter. Splits tokens into tokens and payload whenever a delimiter character is found.
Example: "the|1 quick|2 fox|3" is split by default into tokens the, quick, and fox with payloads 1, 2, and 3 respectively.
Parameters:
delimiter-
Character used for splitting the tokens. Default is
|. encoding-
The type of the payload.
intfor integer,floatfor float andidentityfor characters. Default isfloat.
125.35. Keep Words Token Filter
A token filter of type keep that only keeps tokens with text contained in a
predefined set of words. The set of words can be defined in the settings or
loaded from a text file containing one word per line.
Options
| keep_words |
a list of words to keep |
| keep_words_path |
a path to a words file |
| keep_words_case |
a boolean indicating whether to lower case the words (defaults to |
Settings example
{
"index" : {
"analysis" : {
"analyzer" : {
"my_analyzer" : {
"tokenizer" : "standard",
"filter" : ["standard", "lowercase", "words_till_three"]
},
"my_analyzer1" : {
"tokenizer" : "standard",
"filter" : ["standard", "lowercase", "words_on_file"]
}
},
"filter" : {
"words_till_three" : {
"type" : "keep",
"keep_words" : [ "one", "two", "three"]
},
"words_on_file" : {
"type" : "keep",
"keep_words_path" : "/path/to/word/file"
}
}
}
}
}
125.36. Keep Types Token Filter
A token filter of type keep_types that only keeps tokens with a token type
contained in a predefined set.
Options
| types |
a list of types to keep |
Settings example
{
"index" : {
"analysis" : {
"analyzer" : {
"my_analyzer" : {
"tokenizer" : "standard",
"filter" : ["standard", "lowercase", "extract_numbers"]
},
},
"filter" : {
"extract_numbers" : {
"type" : "keep_types",
"types" : [ "<NUM>" ]
},
}
}
}
}
125.37. Classic Token Filter
The classic token filter does optional post-processing of
terms that are generated by the classic tokenizer.
This filter removes the english possessive from the end of words, and it removes dots from acronyms.
126. Character Filters
Character filters are used to preprocess the string of
characters before it is passed to the tokenizer.
A character filter may be used to strip out HTML markup, or to convert
"&" characters to the word "and".
Elasticsearch has built in characters filters which can be used to build custom analyzers.
126.1. Mapping Char Filter
A char filter of type mapping replacing characters of an analyzed text
with given mapping.
mappings
|
A list of mappings to use. |
mappings_path
|
A path, relative to the |
Here is a sample configuration:
{
"index" : {
"analysis" : {
"char_filter" : {
"my_mapping" : {
"type" : "mapping",
"mappings" : [
"ph => f",
"qu => k"
]
}
},
"analyzer" : {
"custom_with_char_filter" : {
"tokenizer" : "standard",
"char_filter" : ["my_mapping"]
}
}
}
}
}
126.2. HTML Strip Char Filter
A char filter of type html_strip stripping out HTML elements from an
analyzed text.
126.3. Pattern Replace Char Filter
The pattern_replace char filter allows the use of a regex to
manipulate the characters in a string before analysis. The regular
expression is defined using the pattern parameter, and the replacement
string can be provided using the replacement parameter (supporting
referencing the original text, as explained
here).
For more information check the
lucene
documentation
Here is a sample configuration:
{
"index" : {
"analysis" : {
"char_filter" : {
"my_pattern":{
"type":"pattern_replace",
"pattern":"sample(.*)",
"replacement":"replacedSample $1"
}
},
"analyzer" : {
"custom_with_char_filter" : {
"tokenizer" : "standard",
"char_filter" : ["my_pattern"]
}
}
}
}
}
Modules
This section contains modules responsible for various aspects of the functionality in Elasticsearch. Each module has settings which may be:
- static
-
These settings must be set at the node level, either in the
elasticsearch.ymlfile, or as an environment variable or on the command line when starting a node. They must be set on every relevant node in the cluster. - dynamic
-
These settings can be dynamically updated on a live cluster with the cluster-update-settings API.
The modules in this section are:
- Cluster-level routing and shard allocation
-
Settings to control where, when, and how shards are allocated to nodes.
- Discovery
-
How nodes discover each other to form a cluster.
- Gateway
-
How many nodes need to join the cluster before recovery can start.
- HTTP
-
Settings to control the HTTP REST interface.
- Indices
-
Global index-related settings.
- Network
-
Controls default network settings.
- Node client
-
A Java node client joins the cluster, but doesn’t hold data or act as a master node.
- Plugins
-
Using plugins to extend Elasticsearch.
- Scripting
-
Custom scripting available in Lucene Expressions, Groovy, Python, and Javascript.
- Snapshot/Restore
-
Backup your data with snapshot/restore.
- Thread pools
-
Information about the dedicated thread pools used in Elasticsearch.
- Transport
-
Configure the transport networking layer, used internally by Elasticsearch to communicate between nodes.
- Tribe nodes
-
A tribe node joins one or more clusters and acts as a federated client across them.
127. Cluster
One of the main roles of the master is to decide which shards to allocate to which nodes, and when to move shards between nodes in order to rebalance the cluster.
There are a number of settings available to control the shard allocation process:
-
Cluster Level Shard Allocation lists the settings to control the allocation an rebalancing operations.
-
Disk-based Shard Allocation explains how Elasticsearch takes available disk space into account, and the related settings.
-
Shard Allocation Awareness and Forced Awareness control how shards can be distributed across different racks or availability zones.
-
Shard Allocation Filtering allows certain nodes or groups of nodes excluded from allocation so that they can be decommissioned.
Besides these, there are a few other miscellaneous cluster-level settings.
All of the settings in this section are dynamic settings which can be updated on a live cluster with the cluster-update-settings API.
127.1. Cluster Level Shard Allocation
Shard allocation is the process of allocating shards to nodes. This can happen during initial recovery, replica allocation, rebalancing, or when nodes are added or removed.
Shard Allocation Settings
The following dynamic settings may be used to control shard allocation and recovery:
cluster.routing.allocation.enable-
Enable or disable allocation for specific kinds of shards:
-
all- (default) Allows shard allocation for all kinds of shards. -
primaries- Allows shard allocation only for primary shards. -
new_primaries- Allows shard allocation only for primary shards for new indices. -
none- No shard allocations of any kind are allowed for any indices.
This setting does not affect the recovery of local primary shards when restarting a node. A restarted node that has a copy of an unassigned primary shard will recover that primary immediately, assuming that the
index.recovery.initial_shardssetting is satisfied. -
cluster.routing.allocation.node_concurrent_recoveries-
How many concurrent shard recoveries are allowed to happen on a node. Defaults to
2. cluster.routing.allocation.node_initial_primaries_recoveries-
While the recovery of replicas happens over the network, the recovery of an unassigned primary after node restart uses data from the local disk. These should be fast so more initial primary recoveries can happen in parallel on the same node. Defaults to
4. cluster.routing.allocation.same_shard.host-
Allows to perform a check to prevent allocation of multiple instances of the same shard on a single host, based on host name and host address. Defaults to
false, meaning that no check is performed by default. This setting only applies if multiple nodes are started on the same machine. indices.recovery.concurrent_streams-
The number of network streams to open per node to recover a shard from a peer shard. Defaults to
3. indices.recovery.concurrent_small_file_streams-
The number of streams to open per node for small files (under 5mb) to recover a shard from a peer shard. Defaults to
2.
Shard Rebalancing Settings
The following dynamic settings may be used to control the rebalancing of shards across the cluster:
cluster.routing.rebalance.enable-
Enable or disable rebalancing for specific kinds of shards:
-
all- (default) Allows shard balancing for all kinds of shards. -
primaries- Allows shard balancing only for primary shards. -
replicas- Allows shard balancing only for replica shards. -
none- No shard balancing of any kind are allowed for any indices.
-
cluster.routing.allocation.allow_rebalance-
Specify when shard rebalancing is allowed:
-
always- Always allow rebalancing. -
indices_primaries_active- Only when all primaries in the cluster are allocated. -
indices_all_active- (default) Only when all shards (primaries and replicas) in the cluster are allocated.
-
cluster.routing.allocation.cluster_concurrent_rebalance-
Allow to control how many concurrent shard rebalances are allowed cluster wide. Defaults to
2.
Shard Balancing Heuristics
The following settings are used together to determine where to place each
shard. The cluster is balanced when no allowed action can bring the weights
of each node closer together by more then the balance.threshold.
cluster.routing.allocation.balance.shard-
Defines the weight factor for shards allocated on a node (float). Defaults to
0.45f. Raising this raises the tendency to equalize the number of shards across all nodes in the cluster. cluster.routing.allocation.balance.index-
Defines a factor to the number of shards per index allocated on a specific node (float). Defaults to
0.55f. Raising this raises the tendency to equalize the number of shards per index across all nodes in the cluster. cluster.routing.allocation.balance.threshold-
Minimal optimization value of operations that should be performed (non negative float). Defaults to
1.0f. Raising this will cause the cluster to be less aggressive about optimizing the shard balance.
|
|
Regardless of the result of the balancing algorithm, rebalancing might not be allowed due to forced awareness or allocation filtering. |
127.2. Disk-based Shard Allocation
Elasticsearch factors in the available disk space on a node before deciding whether to allocate new shards to that node or to actively relocate shards away from that node.
Below are the settings that can be configured in the elasticsearch.yml config
file or updated dynamically on a live cluster with the
cluster-update-settings API:
cluster.routing.allocation.disk.threshold_enabled-
Defaults to
true. Set tofalseto disable the disk allocation decider. cluster.routing.allocation.disk.watermark.low-
Controls the low watermark for disk usage. It defaults to 85%, meaning ES will not allocate new shards to nodes once they have more than 85% disk used. It can also be set to an absolute byte value (like 500mb) to prevent ES from allocating shards if less than the configured amount of space is available.
cluster.routing.allocation.disk.watermark.high-
Controls the high watermark. It defaults to 90%, meaning ES will attempt to relocate shards to another node if the node disk usage rises above 90%. It can also be set to an absolute byte value (similar to the low watermark) to relocate shards once less than the configured amount of space is available on the node.
|
|
Percentage values refer to used disk space, while byte values refer to free disk space. This can be confusing, since it flips the meaning of high and low. For example, it makes sense to set the low watermark to 10gb and the high watermark to 5gb, but not the other way around. |
cluster.info.update.interval-
How often Elasticsearch should check on disk usage for each node in the cluster. Defaults to
30s. cluster.routing.allocation.disk.include_relocations-
Defaults to
true, which means that Elasticsearch will take into account shards that are currently being relocated to the target node when computing a node’s disk usage. Taking relocating shards' sizes into account may, however, mean that the disk usage for a node is incorrectly estimated on the high side, since the relocation could be 90% complete and a recently retrieved disk usage would include the total size of the relocating shard as well as the space already used by the running relocation.
An example of updating the low watermark to no more than 80% of the disk size, a high watermark of at least 50 gigabytes free, and updating the information about the cluster every minute:
PUT /_cluster/settings
{
"transient": {
"cluster.routing.allocation.disk.watermark.low": "80%",
"cluster.routing.allocation.disk.watermark.high": "50gb",
"cluster.info.update.interval": "1m"
}
}
|
|
Prior to 2.0.0, when using multiple data paths, the disk threshold decider only factored in the usage across all data paths (if you had two data paths, one with 50b out of 100b free (50% used) and another with 40b out of 50b free (80% used) it would see the node’s disk usage as 90b out of 150b). In 2.0.0, the minimum and maximum disk usages are tracked separately. |
127.3. Shard Allocation Awareness
When running nodes on multiple VMs on the same physical server, on multiple racks, or across multiple awareness zones, it is more likely that two nodes on the same physical server, in the same rack, or in the same awareness zone will crash at the same time, rather than two unrelated nodes crashing simultaneously.
If Elasticsearch is aware of the physical configuration of your hardware, it can ensure that the primary shard and its replica shards are spread across different physical servers, racks, or zones, to minimise the risk of losing all shard copies at the same time.
The shard allocation awareness settings allow you to tell Elasticsearch about your hardware configuration.
As an example, let’s assume we have several racks. When we start a node, we
can tell it which rack it is in by assigning it an arbitrary metadata
attribute called rack_id — we could use any attribute name. For example:
./bin/elasticsearch --node.rack_id rack_one 
This setting could also be specified in the elasticsearch.yml config file. |
Now, we need to setup shard allocation awareness by telling Elasticsearch
which attributes to use. This can be configured in the elasticsearch.yml
file on all master-eligible nodes, or it can be set (and changed) with the
cluster-update-settings API.
For our example, we’ll set the value in the config file:
cluster.routing.allocation.awareness.attributes: rack_id
With this config in place, let’s say we start two nodes with node.rack_id
set to rack_one, and we create an index with 5 primary shards and 1 replica
of each primary. All primaries and replicas are allocated across the two
nodes.
Now, if we start two more nodes with node.rack_id set to rack_two,
Elasticsearch will move shards across to the new nodes, ensuring (if possible)
that no two copies of the same shard will be in the same rack. However if rack_two
were to fail, taking down both of its nodes, Elasticsearch will still allocate the lost
shard copies to nodes in rack_one.
Multiple awareness attributes can be specified, in which case the combination of values from each attribute is considered to be a separate value.
cluster.routing.allocation.awareness.attributes: rack_id,zone
|
|
When using awareness attributes, shards will not be allocated to nodes that don’t have values set for those attributes. |
|
|
Number of primary/replica of a shard allocated on a specific group of nodes with the same awareness attribute value is determined by the number of attribute values. When the number of nodes in groups is unbalanced and there are many replicas, replica shards may be left unassigned. |
Forced Awareness
Imagine that you have two awareness zones and enough hardware across the two zones to host all of your primary and replica shards. But perhaps the hardware in a single zone, while sufficient to host half the shards, would be unable to host ALL the shards.
With ordinary awareness, if one zone lost contact with the other zone, Elasticsearch would assign all of the missing replica shards to a single zone. But in this example, this sudden extra load would cause the hardware in the remaining zone to be overloaded.
Forced awareness solves this problem by NEVER allowing copies of the same shard to be allocated to the same zone.
For example, lets say we have an awareness attribute called zone, and
we know we are going to have two zones, zone1 and zone2. Here is how
we can force awareness on a node:
cluster.routing.allocation.awareness.force.zone.values: zone1,zone2
cluster.routing.allocation.awareness.attributes: zone
We must list all possible values that the zone attribute can have. |
Now, if we start 2 nodes with node.zone set to zone1 and create an index
with 5 shards and 1 replica. The index will be created, but only the 5 primary
shards will be allocated (with no replicas). Only when we start more shards
with node.zone set to zone2 will the replicas be allocated.
The cluster.routing.allocation.awareness.* settings can all be updated
dynamically on a live cluster with the
cluster-update-settings API.
127.4. Shard Allocation Filtering
While Index Shard Allocation provides per-index settings to control the allocation of shards to nodes, cluster-level shard allocation filtering allows you to allow or disallow the allocation of shards from any index to particular nodes.
The typical use case for cluster-wide shard allocation filtering is when you want to decommission a node, and you would like to move the shards from that node to other nodes in the cluster before shutting it down.
For instance, we could decommission a node using its IP address as follows:
PUT /_cluster/settings
{
"transient" : {
"cluster.routing.allocation.exclude._ip" : "10.0.0.1"
}
}
|
|
Shards will only be relocated if it is possible to do so without breaking another routing constraint, such as never allocating a primary and replica shard to the same node. |
Cluster-wide shard allocation filtering works in the same way as index-level shard allocation filtering (see Index Shard Allocation for details).
The available dynamic cluster settings are as follows, where {attribute}
refers to an arbitrary node attribute.:
cluster.routing.allocation.include.{attribute}-
Assign the index to a node whose
{attribute}has at least one of the comma-separated values. cluster.routing.allocation.require.{attribute}-
Assign the index to a node whose
{attribute}has all of the comma-separated values. cluster.routing.allocation.exclude.{attribute}-
Assign the index to a node whose
{attribute}has none of the comma-separated values.
These special attributes are also supported:
_name
|
Match nodes by node name |
_ip
|
Match nodes by IP address (the IP address associated with the hostname) |
_host
|
Match nodes by hostname |
All attribute values can be specified with wildcards, eg:
PUT _cluster/settings
{
"transient": {
"cluster.routing.allocation.include._ip": "192.168.2.*"
}
}
127.5. Miscellaneous cluster settings
127.5.1. Metadata
An entire cluster may be set to read-only with the following dynamic setting:
cluster.blocks.read_only-
Make the whole cluster read only (indices do not accept write operations), metadata is not allowed to be modified (create or delete indices).
|
|
Don’t rely on this setting to prevent changes to your cluster. Any user with access to the cluster-update-settings API can make the cluster read-write again. |
128. Discovery
The discovery module is responsible for discovering nodes within a cluster, as well as electing a master node.
Note, Elasticsearch is a peer to peer based system, nodes communicate with one another directly if operations are delegated / broadcast. All the main APIs (index, delete, search) do not communicate with the master node. The responsibility of the master node is to maintain the global cluster state, and act if nodes join or leave the cluster by reassigning shards. Each time a cluster state is changed, the state is made known to the other nodes in the cluster (the manner depends on the actual discovery implementation).
Settings
The cluster.name allows to create separated clusters from one another.
The default value for the cluster name is elasticsearch, though it is
recommended to change this to reflect the logical group name of the
cluster running.
128.1. Azure Discovery
Azure discovery allows to use the Azure APIs to perform automatic discovery (similar to multicast). It is available as a plugin. See cloud-azure for more information.
128.2. EC2 Discovery
EC2 discovery is available as a plugin. See cloud-aws for more information.
128.3. Google Compute Engine Discovery
Google Compute Engine (GCE) discovery allows to use the GCE APIs to perform automatic discovery (similar to multicast). It is available as a plugin. See cloud-gce for more information.
128.4. Zen Discovery
The zen discovery is the built in discovery module for elasticsearch and the default. It provides unicast discovery, but can be extended to support cloud environments and other forms of discovery.
The zen discovery is integrated with other modules, for example, all communication between nodes is done using the transport module.
It is separated into several sub modules, which are explained below:
Ping
This is the process where a node uses the discovery mechanisms to find other nodes.
Unicast
The unicast discovery requires a list of hosts to use that will act
as gossip routers. It provides the following settings with the
discovery.zen.ping.unicast prefix:
| Setting | Description |
|---|---|
|
Either an array setting or a comma delimited setting. Each
value should be in the form of |
The unicast discovery uses the transport module to perform the discovery.
Master Election
As part of the ping process a master of the cluster is either
elected or joined to. This is done automatically. The
discovery.zen.ping_timeout (which defaults to 3s) allows for the
tweaking of election time to handle cases of slow or congested networks
(higher values assure less chance of failure). Once a node joins, it
will send a join request to the master (discovery.zen.join_timeout)
with a timeout defaulting at 20 times the ping timeout.
When the master node stops or has encountered a problem, the cluster nodes start pinging again and will elect a new master. This pinging round also serves as a protection against (partial) network failures where node may unjustly think that the master has failed. In this case the node will simply hear from other nodes about the currently active master.
If discovery.zen.master_election.filter_client is true, pings from client nodes (nodes where node.client is
true, or both node.data and node.master are false) are ignored during master election; the default value is
true. If discovery.zen.master_election.filter_data is true, pings from non-master-eligible data nodes (nodes
where node.data is true and node.master is false) are ignored during master election; the default value is
false. Pings from master-eligible nodes are always observed during master election.
Nodes can be excluded from becoming a master by setting node.master to
false. Note, once a node is a client node (node.client set to
true), it will not be allowed to become a master (node.master is
automatically set to false).
The discovery.zen.minimum_master_nodes sets the minimum
number of master eligible nodes that need to join a newly elected master in order for an election to
complete and for the elected node to accept it’s mastership. The same setting controls the minimum number of
active master eligible nodes that should be a part of any active cluster. If this requirement is not met the
active master node will step down and a new mastser election will be begin.
This setting must be set to a quorum of your master eligible nodes. It is recommended to avoid having only two master eligible nodes, since a quorum of two is two. Therefore, a loss of either master node will result in an inoperable cluster.
Fault Detection
There are two fault detection processes running. The first is by the master, to ping all the other nodes in the cluster and verify that they are alive. And on the other end, each node pings to master to verify if its still alive or an election process needs to be initiated.
The following settings control the fault detection process using the
discovery.zen.fd prefix:
| Setting | Description |
|---|---|
|
How often a node gets pinged. Defaults to |
|
How long to wait for a ping response, defaults to
|
|
How many ping failures / timeouts cause a node to be
considered failed. Defaults to |
Cluster state updates
The master node is the only node in a cluster that can make changes to the
cluster state. The master node processes one cluster state update at a time,
applies the required changes and publishes the updated cluster state to all
the other nodes in the cluster. Each node receives the publish message,
updates its own cluster state and replies to the master node, which waits for
all nodes to respond, up to a timeout, before going ahead processing the next
updates in the queue. The discovery.zen.publish_timeout is set by default
to 30 seconds and can be changed dynamically through the
cluster update settings api
No master block
For the cluster to be fully operational, it must have an active master and the
number of running master eligible nodes must satisfy the
discovery.zen.minimum_master_nodes setting if set. The
discovery.zen.no_master_block settings controls what operations should be
rejected when there is no active master.
The discovery.zen.no_master_block setting has two valid options:
all
|
All operations on the node—i.e. both read & writes—will be rejected. This also applies for api cluster state read or write operations, like the get index settings, put mapping and cluster state api. |
write
|
(default) Write operations will be rejected. Read operations will succeed, based on the last known cluster configuration. This may result in partial reads of stale data as this node may be isolated from the rest of the cluster. |
The discovery.zen.no_master_block setting doesn’t apply to nodes based apis (for example cluster stats, node info and
node stats apis) which will not be blocked and try to execute on any node possible.
129. Local Gateway
The local gateway module stores the cluster state and shard data across full cluster restarts.
The following static settings, which must be set on every data node in the cluster, controls how long nodes should wait before they try to recover any shards which are stored locally:
gateway.expected_nodes-
The number of (data or master) nodes that are expected to be in the cluster. Recovery of local shards will start as soon as the expected number of nodes have joined the cluster. Defaults to
0 gateway.expected_master_nodes-
The number of master nodes that are expected to be in the cluster. Recovery of local shards will start as soon as the expected number of master nodes have joined the cluster. Defaults to
0 gateway.expected_data_nodes-
The number of data nodes that are expected to be in the cluster. Recovery of local shards will start as soon as the expected number of data nodes have joined the cluster. Defaults to
0 gateway.recover_after_time-
If the expected number of nodes is not achieved, the recovery process waits for the configured amount of time before trying to recover regardless. Defaults to
5mif one of theexpected_nodessettings is configured.
Once the recover_after_time duration has timed out, recovery will start
as long as the following conditions are met:
gateway.recover_after_nodes-
Recover as long as this many data or master nodes have joined the cluster.
gateway.recover_after_master_nodes-
Recover as long as this many master nodes have joined the cluster.
gateway.recover_after_data_nodes-
Recover as long as this many data nodes have joined the cluster.
|
|
These settings only take effect on a full cluster restart. |
130. HTTP
The http module allows to expose elasticsearch APIs over HTTP.
The http mechanism is completely asynchronous in nature, meaning that there is no blocking thread waiting for a response. The benefit of using asynchronous communication for HTTP is solving the C10k problem.
When possible, consider using HTTP keep alive when connecting for better performance and try to get your favorite client not to do HTTP chunking.
Settings
The settings in the table below can be configured for HTTP. Note that none of
them are dynamically updatable so for them to take effect they should be set in
elasticsearch.yml.
| Setting | Description |
|---|---|
|
A bind port range. Defaults to |
|
The port that HTTP clients should use when
communicating with this node. Useful when a cluster node is behind a
proxy or firewall and the |
|
The host address to bind the HTTP service to. Defaults to |
|
The host address to publish for HTTP clients to connect to. Defaults to |
|
Used to set the |
|
The max content of an HTTP request. Defaults to
|
|
The max length of an HTTP URL. Defaults
to |
|
The max size of allowed headers. Defaults to |
|
Support for compression when possible (with
Accept-Encoding). Defaults to |
|
Defines the compression level to use.
Defaults to |
|
Enable or disable cross-origin resource sharing,
i.e. whether a browser on another origin can do requests to
Elasticsearch. Defaults to |
|
Which origins to allow. Defaults to no origins
allowed. If you prepend and append a |
|
Browsers send a "preflight" OPTIONS-request to
determine CORS settings. |
|
Which methods to allow. Defaults to
|
|
Which headers to allow. Defaults to
|
|
Whether the |
|
Enables or disables the output of detailed error messages
and stack traces in response output. Note: When set to |
|
Enable or disable HTTP pipelining, defaults to |
|
The maximum number of events to be queued up in memory before a HTTP connection is closed, defaults to |
It also uses the common network settings.
Disable HTTP
The http module can be completely disabled and not started by setting
http.enabled to false. Elasticsearch nodes (and Java clients) communicate
internally using the transport interface, not HTTP. It
might make sense to disable the http layer entirely on nodes which are not
meant to serve REST requests directly. For instance, you could disable HTTP on
data-only nodes if you also have
client nodes which are intended to serve all REST requests.
Be aware, however, that you will not be able to send any REST requests (eg to
retrieve node stats) directly to nodes which have HTTP disabled.
131. Indices
The indices module controls index-related settings that are globally managed for all indices, rather than being configurable at a per-index level.
Available settings include:
- Circuit breaker
-
Circuit breakers set limits on memory usage to avoid out of memory exceptions.
- Fielddata cache
-
Set limits on the amount of heap used by the in-memory fielddata cache.
- Node query cache
-
Configure the amount heap used to cache queries results.
- Indexing buffer
-
Control the size of the buffer allocated to the indexing process.
- Shard request cache
-
Control the behaviour of the shard-level request cache.
- Recovery
-
Control the resource limits on the shard recovery process.
- TTL interval
-
Control how expired documents are removed.
131.1. Circuit Breaker
Elasticsearch contains multiple circuit breakers used to prevent operations from causing an OutOfMemoryError. Each breaker specifies a limit for how much memory it can use. Additionally, there is a parent-level breaker that specifies the total amount of memory that can be used across all breakers.
These settings can be dynamically updated on a live cluster with the cluster-update-settings API.
Parent circuit breaker
The parent-level breaker can be configured with the following setting:
indices.breaker.total.limit-
Starting limit for overall parent breaker, defaults to 70% of JVM heap.
Field data circuit breaker
The field data circuit breaker allows Elasticsearch to estimate the amount of memory a field will require to be loaded into memory. It can then prevent the field data loading by raising an exception. By default the limit is configured to 60% of the maximum JVM heap. It can be configured with the following parameters:
indices.breaker.fielddata.limit-
Limit for fielddata breaker, defaults to 60% of JVM heap
indices.breaker.fielddata.overhead-
A constant that all field data estimations are multiplied with to determine a final estimation. Defaults to 1.03
Request circuit breaker
The request circuit breaker allows Elasticsearch to prevent per-request data structures (for example, memory used for calculating aggregations during a request) from exceeding a certain amount of memory.
indices.breaker.request.limit-
Limit for request breaker, defaults to 40% of JVM heap
indices.breaker.request.overhead-
A constant that all request estimations are multiplied with to determine a final estimation. Defaults to 1
131.2. Fielddata
The field data cache is used mainly when sorting on or computing aggregations on a field. It loads all the field values to memory in order to provide fast document based access to those values. The field data cache can be expensive to build for a field, so its recommended to have enough memory to allocate it, and to keep it loaded.
The amount of memory used for the field
data cache can be controlled using indices.fielddata.cache.size. Note:
reloading the field data which does not fit into your cache will be expensive
and perform poorly.
indices.fielddata.cache.size-
The max size of the field data cache, eg
30%of node heap space, or an absolute value, eg12GB. Defaults to unbounded. Also see Field data circuit breaker.
|
|
These are static settings which must be configured on every data node in the cluster. |
Monitoring field data
You can monitor memory usage for field data as well as the field data circuit breaker using Nodes Stats API
131.3. Node Query Cache
The query cache is responsible for caching the results of queries. There is one queries cache per node that is shared by all shards. The cache implements an LRU eviction policy: when a cache becomes full, the least recently used data is evicted to make way for new data.
The query cache only caches queries which are being used in a filter context.
The following setting is static and must be configured on every data node in the cluster:
indices.queries.cache.size-
Controls the memory size for the filter cache , defaults to
10%. Accepts either a percentage value, like5%, or an exact value, like512mb.
131.4. Indexing Buffer
The indexing buffer is used to store newly indexed documents. When it fills up, the documents in the buffer are written to a segment on disk. It is divided between all shards on the node.
The following settings are static and must be configured on every data node in the cluster:
indices.memory.index_buffer_size-
Accepts either a percentage or a byte size value. It defaults to
10%, meaning that10%of the total heap allocated to a node will be used as the indexing buffer size. indices.memory.min_index_buffer_size-
If the
index_buffer_sizeis specified as a percentage, then this setting can be used to specify an absolute minimum. Defaults to48mb. indices.memory.max_index_buffer_size-
If the
index_buffer_sizeis specified as a percentage, then this setting can be used to specify an absolute maximum. Defaults to unbounded. indices.memory.min_shard_index_buffer_size-
Sets a hard lower limit for the memory allocated per shard for its own indexing buffer. Defaults to
4mb.
131.5. Shard request cache
When a search request is run against an index or against many indices, each involved shard executes the search locally and returns its local results to the coordinating node, which combines these shard-level results into a “global” result set.
The shard-level request cache module caches the local results on each shard. This allows frequently used (and potentially heavy) search requests to return results almost instantly. The requests cache is a very good fit for the logging use case, where only the most recent index is being actively updated — results from older indices will be served directly from the cache.
|
|
For now, the requests cache will only cache the results of search requests
where Queries that use |
Cache invalidation
The cache is smart — it keeps the same near real-time promise as uncached search.
Cached results are invalidated automatically whenever the shard refreshes, but only if the data in the shard has actually changed. In other words, you will always get the same results from the cache as you would for an uncached search request.
The longer the refresh interval, the longer that cached entries will remain valid. If the cache is full, the least recently used cache keys will be evicted.
The cache can be expired manually with the clear-cache API:
curl -XPOST 'localhost:9200/kimchy,elasticsearch/_cache/clear?request_cache=true'
Enabling caching by default
The cache is not enabled by default, but can be enabled when creating a new index as follows:
curl -XPUT localhost:9200/my_index -d'
{
"settings": {
"index.requests.cache.enable": true
}
}
'
It can also be enabled or disabled dynamically on an existing index with the
update-settings API:
curl -XPUT localhost:9200/my_index/_settings -d'
{ "index.requests.cache.enable": true }
'
Enabling caching per request
The request_cache query-string parameter can be used to enable or disable
caching on a per-request basis. If set, it overrides the index-level setting:
curl 'localhost:9200/my_index/_search?request_cache=true' -d'
{
"size": 0,
"aggs": {
"popular_colors": {
"terms": {
"field": "colors"
}
}
}
}
'
|
|
If your query uses a script whose result is not deterministic (e.g.
it uses a random function or references the current time) you should set the
request_cache flag to false to disable caching for that request.
|
Cache key
The whole JSON body is used as the cache key. This means that if the JSON changes — for instance if keys are output in a different order — then the cache key will not be recognised.
|
|
Most JSON libraries support a canonical mode which ensures that JSON keys are always emitted in the same order. This canonical mode can be used in the application to ensure that a request is always serialized in the same way. |
Cache settings
The cache is managed at the node level, and has a default maximum size of 1%
of the heap. This can be changed in the config/elasticsearch.yml file with:
indices.requests.cache.size: 2%
Also, you can use the indices.requests.cache.expire setting to specify a TTL
for cached results, but there should be no reason to do so. Remember that
stale results are automatically invalidated when the index is refreshed. This
setting is provided for completeness' sake only.
Monitoring cache usage
The size of the cache (in bytes) and the number of evictions can be viewed
by index, with the indices-stats API:
curl 'localhost:9200/_stats/request_cache?pretty&human'
or by node with the nodes-stats API:
curl 'localhost:9200/_nodes/stats/indices/request_cache?pretty&human'
131.6. Indices Recovery
The following expert settings can be set to manage the recovery policy.
indices.recovery.concurrent_streams-
Defaults to
3. indices.recovery.concurrent_small_file_streams-
Defaults to
2. indices.recovery.file_chunk_size-
Defaults to
512kb. indices.recovery.translog_ops-
Defaults to
1000. indices.recovery.translog_size-
Defaults to
512kb. indices.recovery.compress-
Defaults to
true. indices.recovery.max_bytes_per_sec-
Defaults to
40mb.
These settings can be dynamically updated on a live cluster with the cluster-update-settings API:
131.7. TTL interval
Documents that have a ttl value set need to be deleted
once they have expired. How and how often they are deleted is controlled by
the following dynamic cluster settings:
indices.ttl.interval-
How often the deletion process runs. Defaults to
60s. indices.ttl.bulk_size-
The deletions are processed with a bulk request. The number of deletions processed can be configured with this settings. Defaults to
10000.
132. Network Settings
Elasticsearch binds to localhost only by default. This is sufficient for you to run a local development server (or even a development cluster, if you start multiple nodes on the same machine), but you will need to configure some basic network settings in order to run a real production cluster across multiple servers.
|
|
Be careful with the network configuration!
Never expose an unprotected node to the public internet. |
Commonly Used Network Settings
network.host-
The node will bind to this hostname or IP address and publish (advertise) this host to other nodes in the cluster. Accepts an IP address, hostname, a special value, or an array of any combination of these.
Defaults to
_local_. discovery.zen.ping.unicast.hosts-
In order to join a cluster, a node needs to know the hostname or IP address of at least some of the other nodes in the cluster. This setting provides the initial list of other nodes that this node will try to contact. Accepts IP addresses or hostnames.
Defaults to
["127.0.0.1", "[::1]"]. http.port-
Port to bind to for incoming HTTP requests. Accepts a single value or a range. If a range is specified, the node will bind to the first available port in the range.
Defaults to
9200-9300. transport.tcp.port-
Port to bind for communication between nodes. Accepts a single value or a range. If a range is specified, the node will bind to the first available port in the range.
Defaults to
9300-9400.
Special values for network.host
The following special values may be passed to network.host:
_[networkInterface]_
|
Addresses of a network interface, for example |
_local_
|
Any loopback addresses on the system, for example |
_site_
|
Any site-local addresses on the system, for example |
_global_
|
Any globally-scoped addresses on the system, for example |
IPv4 vs IPv6
These special values will work over both IPv4 and IPv6 by default, but you can
also limit this with the use of :ipv4 of :ipv6 specifiers. For example,
_en0:ipv4_ would only bind to the IPv4 addresses of interface en0.
|
|
Discovery in the cloud
More special settings are available when running in the cloud with either the AWS Cloud plugin or the Google Compute Engine Cloud plugin installed. |
Advanced network settings
The network.host setting explained in Commonly used network settings
is a shortcut which sets the bind host and the publish host at the same
time. In advanced used cases, such as when running behind a proxy server, you
may need to set these settings to different values:
network.bind_host-
This specifies which network interface(s) a node should bind to in order to listen for incoming requests. A node can bind to multiple interfaces, e.g. two network cards, or a site-local address and a local address. Defaults to
network.host. network.publish_host-
The publish host is the single interface that the node advertises to other nodes in the cluster, so that those nodes can connect to it. Currently an elasticsearch node may be bound to multiple addresses, but only publishes one. If not specified, this defaults to the “best” address from
network.bind_host, sorted by IPv4/IPv6 stack preference, then by reachability.
Both of the above settings can be configured just like network.host — they
accept IP addresses, host names, and
special values.
Advanced TCP Settings
network.tcp.no_delay
|
Enable or disable the TCP no delay
setting. Defaults to |
network.tcp.keep_alive
|
Enable or disable TCP keep alive.
Defaults to |
network.tcp.reuse_address
|
Should an address be reused or not. Defaults to |
network.tcp.send_buffer_size
|
The size of the TCP send buffer (specified with size units). By default not explicitly set. |
network.tcp.receive_buffer_size
|
The size of the TCP receive buffer (specified with size units). By default not explicitly set. |
Transport and HTTP protocols
An Elasticsearch node exposes two network protocols which inherit the above settings, but may be further configured independently:
- TCP Transport
-
Used for communication between nodes in the cluster, by the Java Transport client and by the Tribe node. See the Transport module for more information.
- HTTP
-
Exposes the JSON-over-HTTP interface used by all clients other than the Java clients. See the HTTP module for more information.
133. Node
Any time that you start an instance of Elasticsearch, you are starting a node. A collection of connected nodes is called a cluster. If you are running a single node of Elasticsearch, then you have a cluster of one node.
Every node in the cluster can handle HTTP and
Transport traffic by default. The transport layer
is used exclusively for communication between nodes and between nodes and the
Java TransportClient; the HTTP layer is
used only by external REST clients.
All nodes know about all the other nodes in the cluster and can forward client requests to the appropriate node. Besides that, each node serves one or more purpose:
- Master-eligible node
-
A node that has
node.masterset totrue(default), which makes it eligible to be elected as the master node, which controls the cluster. - Data node
-
A node that has
node.dataset totrue(default). Data nodes hold data and perform data related operations such as CRUD, search, and aggregations. - Client node
-
A client node has both
node.masterandnode.dataset tofalse. It can neither hold data nor become the master node. It behaves as a “smart router” and is used to forward cluster-level requests to the master node and data-related requests (such as search) to the appropriate data nodes. - Tribe node
-
A tribe node, configured via the
tribe.*settings, is a special type of client node that can connect to multiple clusters and perform search and other operations across all connected clusters.
By default a node is both a master-eligible node and a data node. This is very convenient for small clusters but, as the cluster grows, it becomes important to consider separating dedicated master-eligible nodes from dedicated data nodes.
|
|
Coordinating node
Requests like search requests or bulk-indexing requests may involve data held on different data nodes. A search request, for example, is executed in two phases which are coordinated by the node which receives the client request — the coordinating node. In the scatter phase, the coordinating node forwards the request to the data nodes which hold the data. Each data node executes the request locally and returns its results to the coordinating node. In the gather phase, the coordinating node reduces each data node’s results into a single global resultset. This means that a client node needs to have enough memory and CPU in order to deal with the gather phase. |
Master Eligible Node
The master node is responsible for lightweight cluster-wide actions such as creating or deleting an index, tracking which nodes are part of the cluster, and deciding which shards to allocate to which nodes. It is important for cluster health to have a stable master node.
Any master-eligible node (all nodes by default) may be elected to become the master node by the master election process.
Indexing and searching your data is CPU-, memory-, and I/O-intensive work which can put pressure on a node’s resources. To ensure that your master node is stable and not under pressure, it is a good idea in a bigger cluster to split the roles between dedicated master-eligible nodes and dedicated data nodes.
While master nodes can also behave as coordinating nodes and route search and indexing requests from clients to data nodes, it is better not to use dedicated master nodes for this purpose. It is important for the stability of the cluster that master-eligible nodes do as little work as possible.
To create a standalone master-eligible node, set:
node.master: true
node.data: false 
The node.master role is enabled by default. |
|
Disable the node.data role (enabled by default). |
Avoiding split brain with minimum_master_nodes
To prevent data loss, it is vital to configure the
discovery.zen.minimum_master_nodes setting (which defaults to 1) so that
each master-eligible node knows the minimum number of master-eligible nodes
that must be visible in order to form a cluster.
To explain, imagine that you have a cluster consisting of two master-eligible
nodes. A network failure breaks communication between these two nodes. Each
node sees one master-eligible node… itself. With minimum_master_nodes set
to the default of 1, this is sufficient to form a cluster. Each node elects
itself as the new master (thinking that the other master-eligible node has
died) and the result is two clusters, or a split brain. These two nodes
will never rejoin until one node is restarted. Any data that has been written
to the restarted node will be lost.
Now imagine that you have a cluster with three master-eligible nodes, and
minimum_master_nodes set to 2. If a network split separates one node from
the other two nodes, the side with one node cannot see enough master-eligible
nodes and will realise that it cannot elect itself as master. The side with
two nodes will elect a new master (if needed) and continue functioning
correctly. As soon as the network split is resolved, the single node will
rejoin the cluster and start serving requests again.
This setting should be set to a quorum of master-eligible nodes:
(master_eligible_nodes / 2) + 1
In other words, if there are three master-eligible nodes, then minimum master
nodes should be set to (3 / 2) + 1 or 2:
discovery.zen.minimum_master_nodes: 2 
Defaults to 1. |
This setting can also be changed dynamically on a live cluster with the cluster update settings API:
PUT _cluster/settings
{
"transient": {
"discovery.zen.minimum_master_nodes": 2
}
}
|
|
An advantage of splitting the master and data roles between dedicated
nodes is that you can have just three master-eligible nodes and set
minimum_master_nodes to 2. You never have to change this setting, no
matter how many dedicated data nodes you add to the cluster.
|
Data Node
Data nodes hold the shards that contain the documents you have indexed. Data nodes handle data related operations like CRUD, search, and aggregations. These operations are I/O-, memory-, and CPU-intensive. It is important to monitor these resources and to add more data nodes if they are overloaded.
The main benefit of having dedicated data nodes is the separation of the master and data roles.
To create a dedicated data node, set:
node.master: false
node.data: true 
Disable the node.master role (enabled by default). |
|
The node.data role is enabled by default. |
Client Node
If you take away the ability to be able to handle master duties and take away the ability to hold data, then you are left with a client node that can only route requests, handle the search reduce phase, and distribute bulk indexing. Essentially, client nodes behave as smart load balancers.
Standalone client nodes can benefit large clusters by offloading the coordinating node role from data and master-eligible nodes. Client nodes join the cluster and receive the full cluster state, like every other node, and they use the cluster state to route requests directly to the appropriate place(s).
|
|
Adding too many client nodes to a cluster can increase the burden on the entire cluster because the elected master node must await acknowledgement of cluster state updates from every node! The benefit of client nodes should not be overstated — data nodes can happily serve the same purpose as client nodes. |
To create a deciated client node, set:
node.master: false
node.data: false 
Disable the node.master role (enabled by default). |
|
Disable the node.data role (enabled by default). |
Node data path settings
path.data
Every data and master-eligible node requires access to a data directory where
shards and index and cluster metadata will be stored. The path.data defaults
to $ES_HOME/data but can be configured in the elasticsearch.yml config
file an absolute path or a path relative to $ES_HOME as follows:
path.data: /var/elasticsearch/data
Like all node settings, it can also be specified on the command line as:
./bin/elasticsearch --path.data /var/elasticsearch/data
|
|
When using the .zip or .tar.gz distributions, the path.data setting
should be configured to locate the data directory outside the Elasticsearch
home directory, so that the home directory can be deleted without deleting
your data! The RPM and Debian distributions do this for you already.
|
node.max_local_storage_nodes
The data path can be shared by multiple nodes, even by nodes from different clusters. This is very useful for testing failover and different configurations on your development machine. In production, however, it is recommended to run only one node of Elasticsearch per server.
To prevent more than one node from sharing the same data path, add this
setting to the elasticsearch.yml config file:
node.max_local_storage_nodes: 1
|
|
Never run different node types (i.e. master, data, client) from the same data directory. This can lead to unexpected data loss. |
Other node settings
More node settings can be found in Modules. Of particular note are
the cluster.name, the node.name and the
network settings.
134. Plugins
Plugins
Plugins are a way to enhance the basic elasticsearch functionality in a custom manner. They range from adding custom mapping types, custom analyzers (in a more built in fashion), native scripts, custom discovery and more.
See the Plugins documentation for more.
135. Scripting
The scripting module allows to use scripts in order to evaluate custom expressions. For example, scripts can be used to return "script fields" as part of a search request, or can be used to evaluate a custom score for a query and so on.
The scripting module uses by default groovy (previously mvel in 1.3.x and earlier) as the scripting language with some extensions. Groovy is used since it is extremely fast and very simple to use.
|
|
Groovy dynamic scripting off by default from v1.4.3
Groovy dynamic scripting is off by default, preventing dynamic Groovy scripts
from being accepted as part of a request or retrieved from the special
To convert an inline script to a file, take this simple script as an example:
Save the contents of the
Now you can access the script by file name (without the extension):
|
Additional lang plugins are provided to allow to execute scripts in
different languages. All places where a script can be used, a lang parameter
can be provided to define the language of the script. The following are the
supported scripting languages:
| Language | Sandboxed | Required plugin |
|---|---|---|
groovy |
no |
built-in |
expression |
yes |
built-in |
mustache |
yes |
built-in |
javascript |
no |
|
python |
no |
To increase security, Elasticsearch does not allow you to specify scripts for
non-sandboxed languages with a request. Instead, scripts must be placed in the
scripts directory inside the configuration directory (the directory where
elasticsearch.yml is). The default location of this scripts directory can be
changed by setting path.scripts in elasticsearch.yml. Scripts placed into
this directory will automatically be picked up and be available to be used.
Once a script has been placed in this directory, it can be referenced by name.
For example, a script called calculate-score.groovy can be referenced in a
request like this:
$ tree config
config
├── elasticsearch.yml
├── logging.yml
└── scripts
└── calculate-score.groovy
$ cat config/scripts/calculate-score.groovy
log(_score * 2) + my_modifier
curl -XPOST localhost:9200/_search -d '{
"query": {
"function_score": {
"query": {
"match": {
"body": "foo"
}
},
"functions": [
{
"script_score": {
"script": {
"lang": "groovy",
"file": "calculate-score",
"params": {
"my_modifier": 8
}
}
}
}
]
}
}
}'
The name of the script is derived from the hierarchy of directories it
exists under, and the file name without the lang extension. For example,
a script placed under config/scripts/group1/group2/test.py will be
named group1_group2_test.
Indexed Scripts
Elasticsearch allows you to store scripts in an internal index known as
.scripts and reference them by id. There are REST endpoints to manage
indexed scripts as follows:
Requests to the scripts endpoint look like :
/_scripts/{lang}/{id}
Where the lang part is the language the script is in and the id part is the id
of the script. In the .scripts index the type of the document will be set to the lang.
curl -XPOST localhost:9200/_scripts/groovy/indexedCalculateScore -d '{
"script": "log(_score * 2) + my_modifier"
}'
This will create a document with id: indexedCalculateScore and type: groovy in the
.scripts index. The type of the document is the language used by the script.
This script can be accessed at query time by using the id script parameter and passing
the script id:
curl -XPOST localhost:9200/_search -d '{
"query": {
"function_score": {
"query": {
"match": {
"body": "foo"
}
},
"functions": [
{
"script_score": {
"script": {
"id": "indexedCalculateScore",
"lang" : "groovy",
"params": {
"my_modifier": 8
}
}
}
}
]
}
}
}'
The script can be viewed by:
curl -XGET localhost:9200/_scripts/groovy/indexedCalculateScore
This is rendered as:
'{
"script": "log(_score * 2) + my_modifier"
}'
Indexed scripts can be deleted by:
curl -XDELETE localhost:9200/_scripts/groovy/indexedCalculateScore
Enabling dynamic scripting
We recommend running Elasticsearch behind an application or proxy, which protects Elasticsearch from the outside world. If users are allowed to run inline scripts (even in a search request) or indexed scripts, then they have the same access to your box as the user that Elasticsearch is running as. For this reason dynamic scripting is allowed only for sandboxed languages by default.
First, you should not run Elasticsearch as the root user, as this would allow
a script to access or do anything on your server, without limitations. Second,
you should not expose Elasticsearch directly to users, but instead have a proxy
application inbetween. If you do intend to expose Elasticsearch directly to
your users, then you have to decide whether you trust them enough to run scripts
on your box or not.
It is possible to enable scripts based on their source, for
every script engine, through the following settings that need to be added to the
config/elasticsearch.yml file on every node.
script.inline: true
script.indexed: true
While this still allows execution of named scripts provided in the config, or
native Java scripts registered through plugins, it also allows users to run
arbitrary scripts via the API. Instead of sending the name of the file as the
script, the body of the script can be sent instead or retrieved from the
.scripts indexed if previously stored.
There are three possible configuration values for any of the fine-grained script settings:
| Value | Description |
|---|---|
|
scripting is turned off completely, in the context of the setting being set. |
|
scripting is turned on, in the context of the setting being set. |
|
scripts may be executed only for languages that are sandboxed |
The default values are the following:
script.inline: sandbox
script.indexed: sandbox
script.file: true
|
|
Global scripting settings affect the mustache scripting language.
Search templates internally use the mustache language,
and will still be enabled by default as the mustache engine is sandboxed,
but they will be enabled/disabled according to fine-grained settings
specified in elasticsearch.yml.
|
It is also possible to control which operations can execute scripts. The supported operations are:
| Value | Description |
|---|---|
|
Aggregations (wherever they may be used) |
|
Search api, Percolator api and Suggester api (e.g filters, script_fields) |
|
Update api |
|
Any plugin that makes use of scripts under the generic |
Plugins can also define custom operations that they use scripts for instead
of using the generic plugin category. Those operations can be referred to
in the following form: ${pluginName}_${operation}.
The following example disables scripting for update and mapping operations,
regardless of the script source, for any engine. Scripts can still be
executed from sandboxed languages as part of aggregations, search
and plugins execution though, as the above defaults still get applied.
script.update: false
script.mapping: false
Generic settings get applied in order, operation based ones have precedence
over source based ones. Language specific settings are supported too. They
need to be prefixed with the script.engine.<engine> prefix and have
precedence over any other generic settings.
script.engine.groovy.file.aggs: true
script.engine.groovy.file.mapping: true
script.engine.groovy.file.search: true
script.engine.groovy.file.update: true
script.engine.groovy.file.plugin: true
script.engine.groovy.indexed.aggs: true
script.engine.groovy.indexed.mapping: false
script.engine.groovy.indexed.search: true
script.engine.groovy.indexed.update: false
script.engine.groovy.indexed.plugin: false
script.engine.groovy.inline.aggs: true
script.engine.groovy.inline.mapping: false
script.engine.groovy.inline.search: false
script.engine.groovy.inline.update: false
script.engine.groovy.inline.plugin: false
Default Scripting Language
The default scripting language (assuming no lang parameter is provided) is
groovy. In order to change it, set the script.default_lang to the
appropriate language.
Automatic Script Reloading
The config/scripts directory is scanned periodically for changes.
New and changed scripts are reloaded and deleted script are removed
from preloaded scripts cache. The reload frequency can be specified
using resource.reload.interval setting, which defaults to 60s.
To disable script reloading completely set script.auto_reload_enabled
to false.
Native (Java) Scripts
Sometimes groovy and expressions aren’t enough. For those times you can
implement a native script.
The best way to implement a native script is to write a plugin and install it. The plugin documentation has more information on how to write a plugin so that Elasticsearch will properly load it.
To register the actual script you’ll need to implement NativeScriptFactory
to construct the script. The actual script will extend either
AbstractExecutableScript or AbstractSearchScript. The second one is likely
the most useful and has several helpful subclasses you can extend like
AbstractLongSearchScript, AbstractDoubleSearchScript, and
AbstractFloatSearchScript. Finally, your plugin should register the native
script by declaring the onModule(ScriptModule) method.
If you squashed the whole thing into one class it’d look like:
public class MyNativeScriptPlugin extends Plugin {
@Override
public String name() {
return "my-native-script";
}
@Override
public String description() {
return "my native script that does something great";
}
public void onModule(ScriptModule scriptModule) {
scriptModule.registerScript("my_script", MyNativeScriptFactory.class);
}
public static class MyNativeScriptFactory implements NativeScriptFactory {
@Override
public ExecutableScript newScript(@Nullable Map<String, Object> params) {
return new MyNativeScript();
}
@Override
public boolean needsScores() {
return false;
}
}
public static class MyNativeScript extends AbstractFloatSearchScript {
@Override
public float runAsFloat() {
float a = (float) source().get("a");
float b = (float) source().get("b");
return a * b;
}
}
}
You can execute the script by specifying its lang as native, and the name
of the script as the inline:
curl -XPOST localhost:9200/_search -d '{
"query": {
"function_score": {
"query": {
"match": {
"body": "foo"
}
},
"functions": [
{
"script_score": {
"script": {
"inline": "my_script",
"lang" : "native"
}
}
}
]
}
}
}'
Lucene Expressions Scripts
experimental[The Lucene expressions module is undergoing significant development and the exposed functionality is likely to change in the future]
Lucene’s expressions module provides a mechanism to compile a
javascript expression to bytecode. This allows very fast execution,
as if you had written a native script. Expression scripts can be
used in script_score, script_fields, sort scripts and numeric aggregation scripts.
See the expressions module documentation for details on what operators and functions are available.
Variables in expression scripts are available to access:
-
Single valued document fields, e.g.
doc['myfield'].value -
Single valued document fields can also be accessed without
.valuee.g.doc['myfield'] -
Parameters passed into the script, e.g.
mymodifier -
The current document’s score,
_score(only available when used in ascript_score)
Variables in expression scripts that are of type date may use the following member methods:
-
getYear()
-
getMonth()
-
getDayOfMonth()
-
getHourOfDay()
-
getMinutes()
-
getSeconds()
The following example shows the difference in years between the date fields date0 and date1:
doc['date1'].getYear() - doc['date0'].getYear()
There are a few limitations relative to other script languages:
-
Only numeric fields may be accessed
-
Stored fields are not available
-
If a field is sparse (only some documents contain a value), documents missing the field will have a value of
0
Score
In all scripts that can be used in aggregations, the current
document’s score is accessible in _score.
Computing scores based on terms in scripts
Document Fields
Most scripting revolve around the use of specific document fields data.
The doc['field_name'] can be used to access specific field data within
a document (the document in question is usually derived by the context
the script is used). Document fields are very fast to access since they
end up being loaded into memory (all the relevant field values/tokens
are loaded to memory). Note, however, that the doc[...] notation only
allows for simple valued fields (can’t return a json object from it)
and makes sense only on non-analyzed or single term based fields.
The following data can be extracted from a field:
| Expression | Description |
|---|---|
|
The native value of the field. For example, if its a short type, it will be short. |
|
The native array values of the field. For example, if its a short type, it will be short[]. Remember, a field can have several values within a single doc. Returns an empty array if the field has no values. |
|
A boolean indicating if the field has no values within the doc. |
|
A boolean indicating that the field has several values within the corpus. |
|
The latitude of a geo point type. |
|
The longitude of a geo point type. |
|
The latitudes of a geo point type. |
|
The longitudes of a geo point type. |
|
The |
|
The |
|
The |
|
The |
|
The |
|
The |
|
The |
|
The |
|
The |
|
The |
|
The |
|
The |
|
The distance factor of this geo point field from the provided lat/lon. |
|
The distance factor of this geo point field from the provided lat/lon with a default value. |
|
The |
|
The |
|
The |
Stored Fields
Stored fields can also be accessed when executing a script. Note, they
are much slower to access compared with document fields, as they are not
loaded into memory. They can be simply accessed using
_fields['my_field_name'].value or _fields['my_field_name'].values.
Accessing the score of a document within a script
When using scripting for calculating the score of a document (for instance, with
the function_score query), you can access the score using the _score
variable inside of a Groovy script.
Source Field
The source field can also be accessed when executing a script. The
source field is loaded per doc, parsed, and then provided to the script
for evaluation. The _source forms the context under which the source
field can be accessed, for example _source.obj2.obj1.field3.
Accessing _source is much slower compared to using doc
but the data is not loaded into memory. For a single field access _fields may be
faster than using _source due to the extra overhead of potentially parsing large documents.
However, _source may be faster if you access multiple fields or if the source has already been
loaded for other purposes.
Groovy Built In Functions
There are several built in functions that can be used within scripts. They include:
| Function | Description |
|---|---|
|
Returns the trigonometric sine of an angle. |
|
Returns the trigonometric cosine of an angle. |
|
Returns the trigonometric tangent of an angle. |
|
Returns the arc sine of a value. |
|
Returns the arc cosine of a value. |
|
Returns the arc tangent of a value. |
|
Converts an angle measured in degrees to an approximately equivalent angle measured in radians |
|
Converts an angle measured in radians to an approximately equivalent angle measured in degrees. |
|
Returns Euler’s number e raised to the power of value. |
|
Returns the natural logarithm (base e) of a value. |
|
Returns the base 10 logarithm of a value. |
|
Returns the correctly rounded positive square root of a value. |
|
Returns the cube root of a double value. |
|
Computes the remainder operation on two arguments as prescribed by the IEEE 754 standard. |
|
Returns the smallest (closest to negative infinity) value that is greater than or equal to the argument and is equal to a mathematical integer. |
|
Returns the largest (closest to positive infinity) value that is less than or equal to the argument and is equal to a mathematical integer. |
|
Returns the value that is closest in value to the argument and is equal to a mathematical integer. |
|
Returns the angle theta from the conversion of rectangular coordinates (x, y) to polar coordinates (r,theta). |
|
Returns the value of the first argument raised to the power of the second argument. |
|
Returns the closest int to the argument. |
|
Returns a random double value. |
|
Returns the absolute value of a value. |
|
Returns the greater of two values. |
|
Returns the smaller of two values. |
|
Returns the size of an ulp of the argument. |
|
Returns the signum function of the argument. |
|
Returns the hyperbolic sine of a value. |
|
Returns the hyperbolic cosine of a value. |
|
Returns the hyperbolic tangent of a value. |
|
Returns sqrt(x2 + y2) without intermediate overflow or underflow. |
135.1. Text scoring in scripts
Text features, such as term or document frequency for a specific term can be accessed in scripts (see scripting documentation ) with the _index variable. This can be useful if, for example, you want to implement your own scoring model using for example a script inside a function score query.
Statistics over the document collection are computed per shard, not per
index.
Nomenclature:
df
|
document frequency. The number of documents a term appears in. Computed per field. |
tf
|
term frequency. The number times a term appears in a field in one specific document. |
ttf
|
total term frequency. The number of times this term appears in all
documents, that is, the sum of |
df and ttf are computed per shard and therefore these numbers can vary
depending on the shard the current document resides in.
Shard statistics:
_index.numDocs()-
Number of documents in shard.
_index.maxDoc()-
Maximal document number in shard.
_index.numDeletedDocs()-
Number of deleted documents in shard.
Field statistics:
Field statistics can be accessed with a subscript operator like this:
_index['FIELD'].
_index['FIELD'].docCount()-
Number of documents containing the field
FIELD. Does not take deleted documents into account. _index['FIELD'].sumttf()-
Sum of
ttfover all terms that appear in fieldFIELDin all documents. _index['FIELD'].sumdf()-
The sum of
dfs over all terms that appear in fieldFIELDin all documents.
Field statistics are computed per shard and therefore these numbers can vary
depending on the shard the current document resides in.
The number of terms in a field cannot be accessed using the _index variable. See Token count datatype for how to do that.
Term statistics:
Term statistics for a field can be accessed with a subscript operator like
this: _index['FIELD']['TERM']. This will never return null, even if term or field does not exist.
If you do not need the term frequency, call _index['FIELD'].get('TERM', 0)
to avoid unnecessary initialization of the frequencies. The flag will have only
affect is your set the index_options to docs.
_index['FIELD']['TERM'].df()-
dfof termTERMin fieldFIELD. Will be returned, even if the term is not present in the current document. _index['FIELD']['TERM'].ttf()-
The sum of term frequencies of term
TERMin fieldFIELDover all documents. Will be returned, even if the term is not present in the current document. _index['FIELD']['TERM'].tf()-
tfof termTERMin fieldFIELD. Will be 0 if the term is not present in the current document.
Term positions, offsets and payloads:
If you need information on the positions of terms in a field, call
_index['FIELD'].get('TERM', flag) where flag can be
_POSITIONS
|
if you need the positions of the term |
_OFFSETS
|
if you need the offsets of the term |
_PAYLOADS
|
if you need the payloads of the term |
_CACHE
|
if you need to iterate over all positions several times |
The iterator uses the underlying lucene classes to iterate over positions. For efficiency reasons, you can only iterate over positions once. If you need to iterate over the positions several times, set the _CACHE flag.
You can combine the operators with a | if you need more than one info. For
example, the following will return an object holding the positions and payloads,
as well as all statistics:
`_index['FIELD'].get('TERM', _POSITIONS | _PAYLOADS)`
Positions can be accessed with an iterator that returns an object
(POS_OBJECT) holding position, offsets and payload for each term position.
POS_OBJECT.position-
The position of the term.
POS_OBJECT.startOffset-
The start offset of the term.
POS_OBJECT.endOffset-
The end offset of the term.
POS_OBJECT.payload-
The payload of the term.
POS_OBJECT.payloadAsInt(missingValue)-
The payload of the term converted to integer. If the current position has no payload, the
missingValuewill be returned. Call this only if you know that your payloads are integers. POS_OBJECT.payloadAsFloat(missingValue)-
The payload of the term converted to float. If the current position has no payload, the
missingValuewill be returned. Call this only if you know that your payloads are floats. POS_OBJECT.payloadAsString()-
The payload of the term converted to string. If the current position has no payload,
nullwill be returned. Call this only if you know that your payloads are strings.
Example: sums up all payloads for the term foo.
termInfo = _index['my_field'].get('foo',_PAYLOADS);
score = 0;
for (pos in termInfo) {
score = score + pos.payloadAsInt(0);
}
return score;
Term vectors:
The _index variable can only be used to gather statistics for single terms. If you want to use information on all terms in a field, you must store the term vectors (see term_vector). To access them, call
_index.termVectors() to get a
Fields
instance. This object can then be used as described in lucene doc to iterate over fields and then for each field iterate over each term in the field.
The method will return null if the term vectors were not stored.
135.2. Scripting and the Java Security Manager
Elasticsearch runs with the Java Security Manager enabled by default. The security policy in Elasticsearch locks down the permissions granted to each class to the bare minimum required to operate. The benefit of doing this is that it severely limits the attack vectors available to a hacker.
Restricting permissions is particularly important with scripting languages like Groovy and Javascript which are designed to do anything that can be done in Java itself, including writing to the file system, opening sockets to remote servers, etc.
Script Classloader Whitelist
Scripting languages are only allowed to load classes which appear in a
hardcoded whitelist that can be found in
org.elasticsearch.script.ClassPermission.
In a script, attempting to load a class that does not appear in the whitelist
may result in a ClassNotFoundException, for instance this script:
GET _search
{
"script_fields": {
"the_hour": {
"script": "use(java.math.BigInteger); new BigInteger(1)"
}
}
}
will return the following exception:
{
"reason": {
"type": "script_exception",
"reason": "failed to run inline script [use(java.math.BigInteger); new BigInteger(1)] using lang [groovy]",
"caused_by": {
"type": "no_class_def_found_error",
"reason": "java/math/BigInteger",
"caused_by": {
"type": "class_not_found_exception",
"reason": "java.math.BigInteger"
}
}
}
}
However, classloader issues may also result in more difficult to interpret exceptions. For instance, this script:
use(groovy.time.TimeCategory); new Date(123456789).format('HH')
Returns the following exception:
{
"reason": {
"type": "script_exception",
"reason": "failed to run inline script [use(groovy.time.TimeCategory); new Date(123456789).format('HH')] using lang [groovy]",
"caused_by": {
"type": "missing_property_exception",
"reason": "No such property: groovy for class: 8d45f5c1a07a1ab5dda953234863e283a7586240"
}
}
}
Dealing with Java Security Manager issues
If you encounter issues with the Java Security Manager, you have three options for resolving these issues:
Fix the security problem
The safest and most secure long term solution is to change the code causing the security issue. We recognise that this may take time to do correctly and so we provide the following two alternatives.
Disable the Java Security Manager
deprecated[2.2.0,The ability to disable the Java Security Manager will be removed in a future version]
You can disable the Java Security Manager entirely with the
security.manager.enabled command line flag:
./bin/elasticsearch --security.manager.enabled false
|
|
This disables the Security Manager entirely and makes Elasticsearch much more vulnerable to attacks! It is an option that should only be used in the most urgent of situations and for the shortest amount of time possible. Optional security is not secure at all because it will be disabled and leave the system vulnerable. This option will be removed in a future version. |
Customising the classloader whitelist
The classloader whitelist can be customised by tweaking the local Java Security Policy either:
-
system wide:
$JAVA_HOME/lib/security/java.policy, -
for just the
elasticsearchuser:/home/elasticsearch/.java.policy, or -
from a file specified in the
JAVA_OPTSenvironment variable with-Djava.security.policy=someURL:export JAVA_OPTS="${JAVA_OPTS} -Djava.security.policy=file:///path/to/my.policy` ./bin/elasticsearch
Permissions may be granted at the class, package, or global level. For instance:
grant {
permission org.elasticsearch.script.ClassPermission "java.util.Base64"; // allow class
permission org.elasticsearch.script.ClassPermission "java.util.*"; // allow package
permission org.elasticsearch.script.ClassPermission "*"; // allow all (disables filtering basically)
};
Here is an example of how to enable the groovy.time.TimeCategory class:
grant {
permission org.elasticsearch.script.ClassPermission "java.lang.Class";
permission org.elasticsearch.script.ClassPermission "groovy.time.TimeCategory";
};
|
|
Before adding classes to the whitelist, consider the security impact that it will have on Elasticsearch. Do you really need an extra class or can your code be rewritten in a more secure way? It is quite possible that we have not whitelisted a generically useful and safe class. If you have a class that you think should be whitelisted by default, please open an issue on GitHub and we will consider the impact of doing so. |
See http://docs.oracle.com/javase/7/docs/technotes/guides/security/PolicyFiles.html for more information.
136. Snapshot And Restore
The snapshot and restore module allows to create snapshots of individual indices or an entire cluster into a remote repository. At the time of the initial release only shared file system repository was supported, but now a range of backends are available via officially supported repository plugins.
Repositories
Before any snapshot or restore operation can be performed a snapshot repository should be registered in Elasticsearch. The repository settings are repository-type specific. See below for details.
PUT /_snapshot/my_backup
{
"type": "fs",
"settings": {
... repository specific settings ...
}
}
Once a repository is registered, its information can be obtained using the following command:
GET /_snapshot/my_backup
which returns:
{
"my_backup": {
"type": "fs",
"settings": {
"compress": "true",
"location": "/mount/backups/my_backup"
}
}
}
Information about multiple repositories can be fetched in one go by using a comma-delimited list of repository names.
Star wildcards are supported as well. For example, information about repositories that start with repo or that contain backup
can be obtained using the following command:
GET /_snapshot/repo*,*backup*
If a repository name is not specified, or _all is used as repository name Elasticsearch will return information about
all repositories currently registered in the cluster:
GET /_snapshot
or
GET /_snapshot/_all
Shared File System Repository
The shared file system repository ("type": "fs") uses the shared file system to store snapshots. In order to register
the shared file system repository it is necessary to mount the same shared filesystem to the same location on all
master and data nodes. This location (or one of its parent directories) must be registered in the path.repo
setting on all master and data nodes.
Assuming that the shared filesystem is mounted to /mount/backups/my_backup, the following setting should be added to
elasticsearch.yml file:
path.repo: ["/mount/backups", "/mount/longterm_backups"]
The path.repo setting supports Microsoft Windows UNC paths as long as at least server name and share are specified as
a prefix and back slashes are properly escaped:
path.repo: ["\\\\MY_SERVER\\Snapshots"]
After all nodes are restarted, the following command can be used to register the shared file system repository with
the name my_backup:
$ curl -XPUT 'http://localhost:9200/_snapshot/my_backup' -d '{
"type": "fs",
"settings": {
"location": "/mount/backups/my_backup",
"compress": true
}
}'
If the repository location is specified as a relative path this path will be resolved against the first path specified
in path.repo:
$ curl -XPUT 'http://localhost:9200/_snapshot/my_backup' -d '{
"type": "fs",
"settings": {
"location": "my_backup",
"compress": true
}
}'
The following settings are supported:
location
|
Location of the snapshots. Mandatory. |
compress
|
Turns on compression of the snapshot files. Compression is applied only to metadata files (index mapping and settings). Data files are not compressed. Defaults to |
chunk_size
|
Big files can be broken down into chunks during snapshotting if needed. The chunk size can be specified in bytes or by
using size value notation, i.e. 1g, 10m, 5k. Defaults to |
max_restore_bytes_per_sec
|
Throttles per node restore rate. Defaults to |
max_snapshot_bytes_per_sec
|
Throttles per node snapshot rate. Defaults to |
readonly
|
Makes repository read-only. Defaults to |
Read-only URL Repository
The URL repository ("type": "url") can be used as an alternative read-only way to access data created by the shared file
system repository. The URL specified in the url parameter should point to the root of the shared filesystem repository.
The following settings are supported:
url
|
Location of the snapshots. Mandatory. |
URL Repository supports the following protocols: "http", "https", "ftp", "file" and "jar". URL repositories with http:,
https:, and ftp: URLs has to be whitelisted by specifying allowed URLs in the repositories.url.allowed_urls setting.
This setting supports wildcards in the place of host, path, query, and fragment. For example:
repositories.url.allowed_urls: ["http://www.example.org/root/*", "https://*.mydomain.com/*?*#*"]
URL repositories with file: URLs can only point to locations registered in the path.repo setting similar to
shared file system repository.
Repository plugins
Other repository backends are available in these official plugins:
-
AWS Cloud Plugin for S3 repositories
-
HDFS Plugin for Hadoop environments
-
Azure Cloud Plugin for Azure storage repositories
Repository Verification
When a repository is registered, it’s immediately verified on all master and data nodes to make sure that it is functional
on all nodes currently present in the cluster. The verify parameter can be used to explicitly disable the repository
verification when registering or updating a repository:
PUT /_snapshot/s3_repository?verify=false
{
"type": "s3",
"settings": {
"bucket": "my_s3_bucket",
"region": "eu-west-1"
}
}
The verification process can also be executed manually by running the following command:
POST /_snapshot/s3_repository/_verify
It returns a list of nodes where repository was successfully verified or an error message if verification process failed.
Snapshot
A repository can contain multiple snapshots of the same cluster. Snapshots are identified by unique names within the
cluster. A snapshot with the name snapshot_1 in the repository my_backup can be created by executing the following
command:
PUT /_snapshot/my_backup/snapshot_1?wait_for_completion=true
The wait_for_completion parameter specifies whether or not the request should return immediately after snapshot
initialization (default) or wait for snapshot completion. During snapshot initialization, information about all
previous snapshots is loaded into the memory, which means that in large repositories it may take several seconds (or
even minutes) for this command to return even if the wait_for_completion parameter is set to false.
By default a snapshot of all open and started indices in the cluster is created. This behavior can be changed by specifying the list of indices in the body of the snapshot request.
PUT /_snapshot/my_backup/snapshot_1
{
"indices": "index_1,index_2",
"ignore_unavailable": "true",
"include_global_state": false
}
The list of indices that should be included into the snapshot can be specified using the indices parameter that
supports multi index syntax. The snapshot request also supports the
ignore_unavailable option. Setting it to true will cause indices that do not exist to be ignored during snapshot
creation. By default, when ignore_unavailable option is not set and an index is missing the snapshot request will fail.
By setting include_global_state to false it’s possible to prevent the cluster global state to be stored as part of
the snapshot. By default, the entire snapshot will fail if one or more indices participating in the snapshot don’t have
all primary shards available. This behaviour can be changed by setting partial to true.
The index snapshot process is incremental. In the process of making the index snapshot Elasticsearch analyses the list of the index files that are already stored in the repository and copies only files that were created or changed since the last snapshot. That allows multiple snapshots to be preserved in the repository in a compact form. Snapshotting process is executed in non-blocking fashion. All indexing and searching operation can continue to be executed against the index that is being snapshotted. However, a snapshot represents the point-in-time view of the index at the moment when snapshot was created, so no records that were added to the index after the snapshot process was started will be present in the snapshot. The snapshot process starts immediately for the primary shards that has been started and are not relocating at the moment. Before version 1.2.0, the snapshot operation fails if the cluster has any relocating or initializing primaries of indices participating in the snapshot. Starting with version 1.2.0, Elasticsearch waits for relocation or initialization of shards to complete before snapshotting them.
Besides creating a copy of each index the snapshot process can also store global cluster metadata, which includes persistent cluster settings and templates. The transient settings and registered snapshot repositories are not stored as part of the snapshot.
Only one snapshot process can be executed in the cluster at any time. While snapshot of a particular shard is being created this shard cannot be moved to another node, which can interfere with rebalancing process and allocation filtering. Elasticsearch will only be able to move a shard to another node (according to the current allocation filtering settings and rebalancing algorithm) once the snapshot is finished.
Once a snapshot is created information about this snapshot can be obtained using the following command:
GET /_snapshot/my_backup/snapshot_1
Similar as for repositories, information about multiple snapshots can be queried in one go, supporting wildcards as well:
GET /_snapshot/my_backup/snapshot_*,some_other_snapshot
All snapshots currently stored in the repository can be listed using the following command:
GET /_snapshot/my_backup/_all
The command fails if some of the snapshots are unavailable. The boolean parameter ignore_unvailable can be used to
return all snapshots that are currently available.
A currently running snapshot can be retrieved using the following command:
$ curl -XGET "localhost:9200/_snapshot/my_backup/_current"
A snapshot can be deleted from the repository using the following command:
DELETE /_snapshot/my_backup/snapshot_1
When a snapshot is deleted from a repository, Elasticsearch deletes all files that are associated with the deleted snapshot and not used by any other snapshots. If the deleted snapshot operation is executed while the snapshot is being created the snapshotting process will be aborted and all files created as part of the snapshotting process will be cleaned. Therefore, the delete snapshot operation can be used to cancel long running snapshot operations that were started by mistake.
A repository can be deleted using the following command:
DELETE /_snapshot/my_backup
When a repository is deleted, Elasticsearch only removes the reference to the location where the repository is storing the snapshots. The snapshots themselves are left untouched and in place.
Restore
A snapshot can be restored using the following command:
POST /_snapshot/my_backup/snapshot_1/_restore
By default, all indices in the snapshot as well as cluster state are restored. It’s possible to select indices that
should be restored as well as prevent global cluster state from being restored by using indices and
include_global_state options in the restore request body. The list of indices supports
multi index syntax. The rename_pattern and rename_replacement options can be also used to
rename index on restore using regular expression that supports referencing the original text as explained
here.
Set include_aliases to false to prevent aliases from being restored together with associated indices
POST /_snapshot/my_backup/snapshot_1/_restore
{
"indices": "index_1,index_2",
"ignore_unavailable": "true",
"include_global_state": false,
"rename_pattern": "index_(.+)",
"rename_replacement": "restored_index_$1"
}
The restore operation can be performed on a functioning cluster. However, an existing index can be only restored if it’s closed and has the same number of shards as the index in the snapshot. The restore operation automatically opens restored indices if they were closed and creates new indices if they didn’t exist in the cluster. If cluster state is restored, the restored templates that don’t currently exist in the cluster are added and existing templates with the same name are replaced by the restored templates. The restored persistent settings are added to the existing persistent settings.
Partial restore
By default, the entire restore operation will fail if one or more indices participating in the operation don’t have
snapshots of all shards available. It can occur if some shards failed to snapshot for example. It is still possible to
restore such indices by setting partial to true. Please note, that only successfully snapshotted shards will be
restored in this case and all missing shards will be recreated empty.
Changing index settings during restore
Most of index settings can be overridden during the restore process. For example, the following command will restore
the index index_1 without creating any replicas while switching back to default refresh interval:
POST /_snapshot/my_backup/snapshot_1/_restore
{
"indices": "index_1",
"index_settings": {
"index.number_of_replicas": 0
},
"ignore_index_settings": [
"index.refresh_interval"
]
}
Please note, that some settings such as index.number_of_shards cannot be changed during restore operation.
Restoring to a different cluster
The information stored in a snapshot is not tied to a particular cluster or a cluster name. Therefore it’s possible to restore a snapshot made from one cluster into another cluster. All that is required is registering the repository containing the snapshot in the new cluster and starting the restore process. The new cluster doesn’t have to have the same size or topology. However, the version of the new cluster should be the same or newer than the cluster that was used to create the snapshot.
If the new cluster has a smaller size additional considerations should be made. First of all it’s necessary to make sure
that new cluster have enough capacity to store all indices in the snapshot. It’s possible to change indices settings
during restore to reduce the number of replicas, which can help with restoring snapshots into smaller cluster. It’s also
possible to select only subset of the indices using the indices parameter. Prior to version 1.5.0 elasticsearch
didn’t check restored persistent settings making it possible to accidentally restore an incompatible
discovery.zen.minimum_master_nodes setting, and as a result disable a smaller cluster until the required number of
master eligible nodes is added. Starting with version 1.5.0 incompatible settings are ignored.
If indices in the original cluster were assigned to particular nodes using shard allocation filtering, the same rules will be enforced in the new cluster. Therefore if the new cluster doesn’t contain nodes with appropriate attributes that a restored index can be allocated on, such index will not be successfully restored unless these index allocation settings are changed during restore operation.
Snapshot status
A list of currently running snapshots with their detailed status information can be obtained using the following command:
GET /_snapshot/_status
In this format, the command will return information about all currently running snapshots. By specifying a repository name, it’s possible to limit the results to a particular repository:
GET /_snapshot/my_backup/_status
If both repository name and snapshot id are specified, this command will return detailed status information for the given snapshot even if it’s not currently running:
GET /_snapshot/my_backup/snapshot_1/_status
Multiple ids are also supported:
GET /_snapshot/my_backup/snapshot_1,snapshot_2/_status
Monitoring snapshot/restore progress
There are several ways to monitor the progress of the snapshot and restores processes while they are running. Both
operations support wait_for_completion parameter that would block client until the operation is completed. This is
the simplest method that can be used to get notified about operation completion.
The snapshot operation can be also monitored by periodic calls to the snapshot info:
GET /_snapshot/my_backup/snapshot_1
Please note that snapshot info operation uses the same resources and thread pool as the snapshot operation. So, executing a snapshot info operation while large shards are being snapshotted can cause the snapshot info operation to wait for available resources before returning the result. On very large shards the wait time can be significant.
To get more immediate and complete information about snapshots the snapshot status command can be used instead:
GET /_snapshot/my_backup/snapshot_1/_status
While snapshot info method returns only basic information about the snapshot in progress, the snapshot status returns complete breakdown of the current state for each shard participating in the snapshot.
The restore process piggybacks on the standard recovery mechanism of the Elasticsearch. As a result, standard recovery
monitoring services can be used to monitor the state of restore. When restore operation is executed the cluster
typically goes into red state. It happens because the restore operation starts with "recovering" primary shards of the
restored indices. During this operation the primary shards become unavailable which manifests itself in the red cluster
state. Once recovery of primary shards is completed Elasticsearch is switching to standard replication process that
creates the required number of replicas at this moment cluster switches to the yellow state. Once all required replicas
are created, the cluster switches to the green states.
The cluster health operation provides only a high level status of the restore process. It’s possible to get more detailed insight into the current state of the recovery process by using indices recovery and cat recovery APIs.
Stopping currently running snapshot and restore operations
The snapshot and restore framework allows running only one snapshot or one restore operation at a time. If a currently running snapshot was executed by mistake, or takes unusually long, it can be terminated using the snapshot delete operation. The snapshot delete operation checks if the deleted snapshot is currently running and if it does, the delete operation stops that snapshot before deleting the snapshot data from the repository.
The restore operation uses the standard shard recovery mechanism. Therefore, any currently running restore operation can be canceled by deleting indices that are being restored. Please note that data for all deleted indices will be removed from the cluster as a result of this operation.
Effect of cluster blocks on snapshot and restore operations
Many snapshot and restore operations are affected by cluster and index blocks. For example, registering and unregistering repositories require write global metadata access. The snapshot operation requires that all indices and their metadata as well as the global metadata were readable. The restore operation requires the global metadata to be writable, however the index level blocks are ignored during restore because indices are essentially recreated during restore. Please note that a repository content is not part of the cluster and therefore cluster blocks don’t affect internal repository operations such as listing or deleting snapshots from an already registered repository.
137. Thread Pool
A node holds several thread pools in order to improve how threads memory consumption are managed within a node. Many of these pools also have queues associated with them, which allow pending requests to be held instead of discarded.
There are several thread pools, but the important ones include:
generic-
For generic operations (e.g., background node discovery). Thread pool type is
cached. index-
For index/delete operations. Thread pool type is
fixedwith a size of# of available processors, queue_size of200. search-
For count/search operations. Thread pool type is
fixedwith a size ofint((# of available_processors * 3) / 2) + 1, queue_size of1000. suggest-
For suggest operations. Thread pool type is
fixedwith a size of# of available processors, queue_size of1000. get-
For get operations. Thread pool type is
fixedwith a size of# of available processors, queue_size of1000. bulk-
For bulk operations. Thread pool type is
fixedwith a size of# of available processors, queue_size of50. percolate-
For percolate operations. Thread pool type is
fixedwith a size of# of available processors, queue_size of1000. snapshot-
For snapshot/restore operations. Thread pool type is
scalingwith a keep-alive of5mand a size ofmin(5, (# of available processors)/2). warmer-
For segment warm-up operations. Thread pool type is
scalingwith a keep-alive of5mand a size ofmin(5, (# of available processors)/2). refresh-
For refresh operations. Thread pool type is
scalingwith a keep-alive of5mand a size ofmin(10, (# of available processors)/2). listener-
Mainly for java client executing of action when listener threaded is set to true. Thread pool type is
scalingwith a default size ofmin(10, (# of available processors)/2).
Changing a specific thread pool can be done by setting its type-specific parameters; for example, changing the index
thread pool to have more threads:
threadpool:
index:
size: 30
|
|
you can update thread pool settings dynamically using Cluster Update Settings. |
Thread pool types
The following are the types of thread pools and their respective parameters:
cached
The cached thread pool is an unbounded thread pool that will spawn a
thread if there are pending requests. This thread pool is used to
prevent requests submitted to this pool from blocking or being
rejected. Unused threads in this thread pool will be terminated after
a keep alive expires (defaults to five minutes). The cached thread
pool is reserved for the generic thread pool.
The keep_alive parameter determines how long a thread should be kept
around in the thread pool without doing any work.
threadpool:
generic:
keep_alive: 2m
fixed
The fixed thread pool holds a fixed size of threads to handle the
requests with a queue (optionally bounded) for pending requests that
have no threads to service them.
The size parameter controls the number of threads, and defaults to the
number of cores times 5.
The queue_size allows to control the size of the queue of pending
requests that have no threads to execute them. By default, it is set to
-1 which means its unbounded. When a request comes in and the queue is
full, it will abort the request.
threadpool:
index:
size: 30
queue_size: 1000
scaling
The scaling thread pool holds a dynamic number of threads. This number is
proportional to the workload and varies between 1 and the value of the
size parameter.
The keep_alive parameter determines how long a thread should be kept
around in the thread pool without it doing any work.
threadpool:
warmer:
size: 8
keep_alive: 2m
Processors setting
The number of processors is automatically detected, and the thread pool
settings are automatically set based on it. Sometimes, the number of processors
are wrongly detected, in such cases, the number of processors can be
explicitly set using the processors setting.
In order to check the number of processors detected, use the nodes info
API with the os flag.
138. Transport
The transport module is used for internal communication between nodes within the cluster. Each call that goes from one node to the other uses the transport module (for example, when an HTTP GET request is processed by one node, and should actually be processed by another node that holds the data).
The transport mechanism is completely asynchronous in nature, meaning that there is no blocking thread waiting for a response. The benefit of using asynchronous communication is first solving the C10k problem, as well as being the ideal solution for scatter (broadcast) / gather operations such as search in ElasticSearch.
TCP Transport
The TCP transport is an implementation of the transport module using TCP. It allows for the following settings:
| Setting | Description |
|---|---|
|
A bind port range. Defaults to |
|
The port that other nodes in the cluster
should use when communicating with this node. Useful when a cluster node
is behind a proxy or firewall and the |
|
The host address to bind the transport service to. Defaults to |
|
The host address to publish for nodes in the cluster to connect to. Defaults to |
|
Used to set the |
|
The socket connect timeout setting (in
time setting format). Defaults to |
|
Set to |
|
Schedule a regular ping message to ensure that connections are kept alive. Defaults to |
It also uses the common network settings.
TCP Transport Profiles
Elasticsearch allows you to bind to multiple ports on different interfaces by the use of transport profiles. See this example configuration
transport.profiles.default.port: 9300-9400
transport.profiles.default.bind_host: 10.0.0.1
transport.profiles.client.port: 9500-9600
transport.profiles.client.bind_host: 192.168.0.1
transport.profiles.dmz.port: 9700-9800
transport.profiles.dmz.bind_host: 172.16.1.2
The default profile is a special. It is used as fallback for any other profiles, if those do not have a specific configuration setting set.
Note that the default profile is how other nodes in the cluster will connect to this node usually. In the future this feature will allow to enable node-to-node communication via multiple interfaces.
The following parameters can be configured like that
-
port: The port to bind to -
bind_host: The host to bind -
publish_host: The host which is published in informational APIs -
tcp_no_delay: Configures theTCP_NO_DELAYoption for this socket -
tcp_keep_alive: Configures theSO_KEEPALIVEoption for this socket -
reuse_address: Configures theSO_REUSEADDRoption for this socket -
tcp_send_buffer_size: Configures the send buffer size of the socket -
tcp_receive_buffer_size: Configures the receive buffer size of the socket
Local Transport
This is a handy transport to use when running integration tests within
the JVM. It is automatically enabled when using
NodeBuilder#local(true).
Transport Tracer
The transport module has a dedicated tracer logger which, when activated, logs incoming and out going requests. The log can be dynamically activated
by settings the level of the transport.tracer logger to TRACE:
curl -XPUT localhost:9200/_cluster/settings -d '{
"transient" : {
"logger.transport.tracer" : "TRACE"
}
}'
You can also control which actions will be traced, using a set of include and exclude wildcard patterns. By default every request will be traced except for fault detection pings:
curl -XPUT localhost:9200/_cluster/settings -d '{
"transient" : {
"transport.tracer.include" : "*"
"transport.tracer.exclude" : "internal:discovery/zen/fd*"
}
}'
139. Tribe node
The tribes feature allows a tribe node to act as a federated client across multiple clusters.
The tribe node works by retrieving the cluster state from all connected clusters and merging them into a global cluster state. With this information at hand, it is able to perform read and write operations against the nodes in all clusters as if they were local. Note that a tribe node needs to be able to connect to each single node in every configured cluster.
The elasticsearch.yml config file for a tribe node just needs to list the
clusters that should be joined, for instance:
tribe:
t1:
cluster.name: cluster_one
t2:
cluster.name: cluster_two
t1 and t2 are arbitrary names representing the connection to each
cluster. |
The example above configures connections to two clusters, name t1 and t2
respectively. The tribe node will create a node client to
connect each cluster using unicast discovery by default. Any
other settings for the connection can be configured under tribe.{name}, just
like the cluster.name in the example.
The merged global cluster state means that almost all operations work in the same way as a single cluster: distributed search, suggest, percolation, indexing, etc.
However, there are a few exceptions:
-
The merged view cannot handle indices with the same name in multiple clusters. By default it will pick one of them, see later for on_conflict options.
-
Master level read operations (eg Cluster State, Cluster Health) will automatically execute with a local flag set to true since there is no master.
-
Master level write operations (eg Create Index) are not allowed. These should be performed on a single cluster.
The tribe node can be configured to block all write operations and all metadata operations with:
tribe:
blocks:
write: true
metadata: true
The tribe node can also configure blocks on selected indices:
tribe:
blocks:
write.indices: hk*,ldn*
metadata.indices: hk*,ldn*
When there is a conflict and multiple clusters hold the same index, by default
the tribe node will pick one of them. This can be configured using the tribe.on_conflict
setting. It defaults to any, but can be set to drop (drop indices that have
a conflict), or prefer_[tribeName] to prefer the index from a specific tribe.
Tribe node settings
The tribe node starts a node client for each listed cluster. The following configuration options are passed down from the tribe node to each node client:
-
node.name(used to derive thenode.namefor each node client) -
network.host -
network.bind_host -
network.publish_host -
transport.host -
transport.bind_host -
transport.publish_host -
path.home -
path.conf -
path.plugins -
path.logs -
path.scripts -
shield.*
Almost any setting (except for path.*) may be configured at the node client
level itself, in which case it will override any passed through setting from
the tribe node. Settings you may want to set at the node client level
include:
-
network.host -
network.bind_host -
network.publish_host -
transport.host -
transport.bind_host -
transport.publish_host -
cluster.name -
discovery.zen.ping.unicast.hosts
path.scripts: some/path/to/config
network.host: 192.168.1.5
tribe:
t1:
cluster.name: cluster_one
t2:
cluster.name: cluster_two
network.host: 10.1.2.3 
The path.scripts setting is inherited by both t1 and t2. |
|
The network.host setting is inherited by t1. |
|
The t3 node client overrides the inherited from the tribe node. |
Index Modules
Index Modules are modules created per index and control all aspects related to an index.
Index Settings
Index level settings can be set per-index. Settings may be:
- static
-
They can only be set at index creation time or on a closed index.
- dynamic
-
They can be changed on a live index using the update-index-settings API.
|
|
Changing static or dynamic index settings on a closed index could result in incorrect settings that are impossible to rectify without deleting and recreating the index. |
Static index settings
Below is a list of all static index settings that are not associated with any specific index module:
index.number_of_shards-
The number of primary shards that an index should have. Defaults to 5. This setting can only be set at index creation time. It cannot be changed on a closed index.
index.shard.check_on_startup
experimental[] Whether or not shards should be checked for corruption before opening. When corruption is detected, it will prevent the shard from being opened. Accepts:
false-
(default) Don’t check for corruption when opening a shard.
checksum-
Check for physical corruption.
true-
Check for both physical and logical corruption. This is much more expensive in terms of CPU and memory usage.
fix-
Check for both physical and logical corruption. Segments that were reported as corrupted will be automatically removed. This option may result in data loss. Use with extreme caution!
Checking shards may take a lot of time on large indices.
-
index.codec -
The
defaultvalue compresses stored data with LZ4 compression, but this can be set tobest_compressionwhich uses DEFLATE for a higher compression ratio, at the expense of slower stored fields performance.
Dynamic index settings
Below is a list of all dynamic index settings that are not associated with any specific index module:
index.number_of_replicas-
The number of replicas each primary shard has. Defaults to 1.
index.auto_expand_replicas-
Auto-expand the number of replicas based on the number of available nodes. Set to a dash delimited lower and upper bound (e.g.
0-5) or useallfor the upper bound (e.g.0-all). Defaults tofalse(i.e. disabled). index.refresh_interval-
How often to perform a refresh operation, which makes recent changes to the index visible to search. Defaults to
1s. Can be set to-1to disable refresh. index.max_result_window-
The maximum value of
from + sizefor searches to this index. Defaults to10000. Search requests take heap memory and time proportional tofrom + sizeand this limits that memory. See {ref}/search-request-scroll.html[Scroll] for a more efficient alternative to raising this. index.blocks.read_only-
Set to
trueto make the index and index metadata read only,falseto allow writes and metadata changes. index.blocks.read-
Set to
trueto disable read operations against the index. index.blocks.write-
Set to
trueto disable write operations against the index. index.blocks.metadata-
Set to
trueto disable index metadata reads and writes. index.ttl.disable_purge-
experimental[] Disables the purge of expired docs on the current index.
index.recovery.initial_shards
A primary shard is only recovered only if there are enough nodes available to allocate sufficient replicas to form a quorum. It can be set to:
-
quorum(default) -
quorum-1(orhalf) -
full -
full-1. -
Number values are also supported, e.g.
1.
Settings in other index modules
Other index settings are available in index modules:
- Analysis
-
Settings to define analyzers, tokenizers, token filters and character filters.
- Index shard allocation
-
Control over where, when, and how shards are allocated to nodes.
- Mapping
-
Enable or disable dynamic mapping for an index.
- Merging
-
Control over how shards are merged by the background merge process.
- Similarities
-
Configure custom similarity settings to customize how search results are scored.
- Slowlog
-
Control over how slow queries and fetch requests are logged.
- Store
-
Configure the type of filesystem used to access shard data.
- Translog
-
Control over the transaction log and background flush operations.
140. Analysis
The index analysis module acts as a configurable registry of analyzers that can be used in order to convert a string field into individual terms which are:
-
added to the inverted index in order to make the document searchable
-
used by high level queries such as the
matchquery to generate search terms.
See Analysis for configuration details.
141. Index Shard Allocation
This module provides per-index settings to control the allocation of shards to nodes:
-
Shard allocation filtering: Controlling which shards are allocated to which nodes.
-
Delayed allocation: Delaying allocation of unassigned shards caused by a node leaving.
-
Total shards per node: A hard limit on the number of shards from the same index per node.
141.1. Shard Allocation Filtering
Shard allocation filtering allows you to specify which nodes are allowed to host the shards of a particular index.
|
|
The per-index shard allocation filters explained below work in conjunction with the cluster-wide allocation filters explained in Cluster Level Shard Allocation. |
It is possible to assign arbitrary metadata attributes to each node at
startup. For instance, nodes could be assigned a rack and a group
attribute as follows:
bin/elasticsearch --node.rack rack1 --node.size big 
These attribute settings can also be specified in the elasticsearch.yml config file. |
These metadata attributes can be used with the
index.routing.allocation.* settings to allocate an index to a particular
group of nodes. For instance, we can move the index test to either big or
medium nodes as follows:
PUT test/_settings
{
"index.routing.allocation.include.size": "big,medium"
}
Alternatively, we can move the index test away from the small nodes with
an exclude rule:
PUT test/_settings
{
"index.routing.allocation.exclude.size": "small"
}
Multiple rules can be specified, in which case all conditions must be
satisfied. For instance, we could move the index test to big nodes in
rack1 with the following:
PUT test/_settings
{
"index.routing.allocation.include.size": "big",
"index.routing.allocation.include.rack": "rack1"
}
|
|
If some conditions cannot be satisfied then shards will not be moved. |
The following settings are dynamic, allowing live indices to be moved from one set of nodes to another:
index.routing.allocation.include.{attribute}-
Assign the index to a node whose
{attribute}has at least one of the comma-separated values. index.routing.allocation.require.{attribute}-
Assign the index to a node whose
{attribute}has all of the comma-separated values. index.routing.allocation.exclude.{attribute}-
Assign the index to a node whose
{attribute}has none of the comma-separated values.
These special attributes are also supported:
_name
|
Match nodes by node name |
_host_ip
|
Match nodes by host IP address (IP associated with hostname) |
_publish_ip
|
Match nodes by publish IP address |
_ip
|
Match either |
_host
|
Match nodes by hostname |
All attribute values can be specified with wildcards, eg:
PUT test/_settings
{
"index.routing.allocation.include._ip": "192.168.2.*"
}
141.2. Delaying allocation when a node leaves
When a node leaves the cluster for whatever reason, intentional or otherwise, the master reacts by:
-
Promoting a replica shard to primary to replace any primaries that were on the node.
-
Allocating replica shards to replace the missing replicas (assuming there are enough nodes).
-
Rebalancing shards evenly across the remaining nodes.
These actions are intended to protect the cluster against data loss by ensuring that every shard is fully replicated as soon as possible.
Even though we throttle concurrent recoveries both at the node level and at the cluster level, this “shard-shuffle” can still put a lot of extra load on the cluster which may not be necessary if the missing node is likely to return soon. Imagine this scenario:
-
Node 5 loses network connectivity.
-
The master promotes a replica shard to primary for each primary that was on Node 5.
-
The master allocates new replicas to other nodes in the cluster.
-
Each new replica makes an entire copy of the primary shard across the network.
-
More shards are moved to different nodes to rebalance the cluster.
-
Node 5 returns after a few minutes.
-
The master rebalances the cluster by allocating shards to Node 5.
If the master had just waited for a few minutes, then the missing shards could have been re-allocated to Node 5 with the minimum of network traffic. This process would be even quicker for idle shards (shards not receiving indexing requests) which have been automatically sync-flushed.
The allocation of replica shards which become unassigned because a node has
left can be delayed with the index.unassigned.node_left.delayed_timeout
dynamic setting, which defaults to 1m.
This setting can be updated on a live index (or on all indices):
PUT /_all/_settings
{
"settings": {
"index.unassigned.node_left.delayed_timeout": "5m"
}
}
With delayed allocation enabled, the above scenario changes to look like this:
-
Node 5 loses network connectivity.
-
The master promotes a replica shard to primary for each primary that was on Node 5.
-
The master logs a message that allocation of unassigned shards has been delayed, and for how long.
-
The cluster remains yellow because there are unassigned replica shards.
-
Node 5 returns after a few minutes, before the
timeoutexpires. -
The missing replicas are re-allocated to Node 5 (and sync-flushed shards recover almost immediately).
|
|
This setting will not affect the promotion of replicas to primaries, nor will it affect the assignment of replicas that have not been assigned previously. In particular, delayed allocation does not come into effect after a full cluster restart. Also, in case of a master failover situation, elapsed delay time is forgotten (i.e. reset to the full initial delay). |
141.2.1. Cancellation of shard relocation
If delayed allocation times out, the master assigns the missing shards to another node which will start recovery. If the missing node rejoins the cluster, and its shards still have the same sync-id as the primary, shard relocation will be cancelled and the synced shard will be used for recovery instead.
For this reason, the default timeout is set to just one minute: even if shard
relocation begins, cancelling recovery in favour of the synced shard is cheap.
141.2.2. Monitoring delayed unassigned shards
The number of shards whose allocation has been delayed by this timeout setting can be viewed with the cluster health API:
GET _cluster/health 
This request will return a delayed_unassigned_shards value. |
141.2.3. Removing a node permanently
If a node is not going to return and you would like Elasticsearch to allocate the missing shards immediately, just update the timeout to zero:
PUT /_all/_settings
{
"settings": {
"index.unassigned.node_left.delayed_timeout": "0"
}
}
You can reset the timeout as soon as the missing shards have started to recover.
141.3. Index recovery prioritization
Unallocated shards are recovered in order of priority, whenever possible. Indices are sorted into priority order as follows:
-
the optional
index.prioritysetting (higher before lower) -
the index creation date (higher before lower)
-
the index name (higher before lower)
This means that, by default, newer indices will be recovered before older indices.
Use the per-index dynamically updateable index.priority setting to customise
the index prioritization order. For instance:
PUT index_1
PUT index_2
PUT index_3
{
"settings": {
"index.priority": 10
}
}
PUT index_4
{
"settings": {
"index.priority": 5
}
}
In the above example:
-
index_3will be recovered first because it has the highestindex.priority. -
index_4will be recovered next because it has the next highest priority. -
index_2will be recovered next because it was created more recently. -
index_1will be recovered last.
This setting accepts an integer, and can be updated on a live index with the update index settings API:
PUT index_4/_settings
{
"index.priority": 1
}
141.4. Total Shards Per Node
The cluster-level shard allocator tries to spread the shards of a single index across as many nodes as possible. However, depending on how many shards and indices you have, and how big they are, it may not always be possible to spread shards evenly.
The following dynamic setting allows you to specify a hard limit on the total number of shards from a single index allowed per node:
index.routing.allocation.total_shards_per_node-
The maximum number of shards (replicas and primaries) that will be allocated to a single node. Defaults to unbounded.
You can also limit the amount of shards a node can have regardless of the index:
cluster.routing.allocation.total_shards_per_node-
The maximum number of shards (replicas and primaries) that will be allocated to a single node globally. Defaults to unbounded (-1).
|
|
These settings impose a hard limit which can result in some shards not being allocated. Use with caution. |
142. Mapper
The mapper module acts as a registry for the type mapping definitions added to an index either when creating it or by using the put mapping api. It also handles the dynamic mapping support for types that have no explicit mappings pre defined. For more information about mapping definitions, check out the mapping section.
143. Merge
A shard in elasticsearch is a Lucene index, and a Lucene index is broken down into segments. Segments are internal storage elements in the index where the index data is stored, and are immutable. Smaller segments are periodically merged into larger segments to keep the index size at bay and to expunge deletes.
The merge process uses auto-throttling to balance the use of hardware resources between merging and other activities like search.
Merge scheduling
The merge scheduler (ConcurrentMergeScheduler) controls the execution of merge operations when they are needed. Merges run in separate threads, and when the maximum number of threads is reached, further merges will wait until a merge thread becomes available.
The merge scheduler supports the following dynamic setting:
index.merge.scheduler.max_thread_count-
The maximum number of threads that may be merging at once. Defaults to
Math.max(1, Math.min(4, Runtime.getRuntime().availableProcessors() / 2))which works well for a good solid-state-disk (SSD). If your index is on spinning platter drives instead, decrease this to 1.
144. Similarity module
A similarity (scoring / ranking model) defines how matching documents are scored. Similarity is per field, meaning that via the mapping one can define a different similarity per field.
Configuring a custom similarity is considered a expert feature and the
builtin similarities are most likely sufficient as is described in
similarity.
Configuring a similarity
Most existing or custom Similarities have configuration options which can be configured via the index settings as shown below. The index options can be provided when creating an index or updating index settings.
"similarity" : {
"my_similarity" : {
"type" : "DFR",
"basic_model" : "g",
"after_effect" : "l",
"normalization" : "h2",
"normalization.h2.c" : "3.0"
}
}
Here we configure the DFRSimilarity so it can be referenced as
my_similarity in mappings as is illustrate in the below example:
{
"book" : {
"properties" : {
"title" : { "type" : "string", "similarity" : "my_similarity" }
}
}
Available similarities
Default similarity
The default similarity that is based on the TF/IDF model. This similarity has the following option:
discount_overlaps-
Determines whether overlap tokens (Tokens with 0 position increment) are ignored when computing norm. By default this is true, meaning overlap tokens do not count when computing norms.
Type name: default
BM25 similarity
Another TF/IDF based similarity that has built-in tf normalization and is supposed to work better for short fields (like names). See Okapi_BM25 for more details. This similarity has the following options:
k1
|
Controls non-linear term frequency normalization (saturation). |
b
|
Controls to what degree document length normalizes tf values. |
discount_overlaps
|
Determines whether overlap tokens (Tokens with 0 position increment) are ignored when computing norm. By default this is true, meaning overlap tokens do not count when computing norms. |
Type name: BM25
DFR similarity
Similarity that implements the divergence from randomness framework. This similarity has the following options:
basic_model
|
Possible values: |
after_effect
|
Possible values: |
normalization
|
Possible values: |
All options but the first option need a normalization value.
Type name: DFR
DFI similarity
Similarity that implements the divergence from independence model. This similarity has the following options:
independence_measure
|
Possible values |
IB similarity.
Information based model . The algorithm is based on the concept that the information content in any symbolic distribution sequence is primarily determined by the repetitive usage of its basic elements. For written texts this challenge would correspond to comparing the writing styles of different authors. This similarity has the following options:
distribution
|
Possible values: |
lambda
|
Possible values: |
normalization
|
Same as in |
Type name: IB
LM Dirichlet similarity.
LM Dirichlet similarity . This similarity has the following options:
mu
|
Default to |
Type name: LMDirichlet
LM Jelinek Mercer similarity.
LM Jelinek Mercer similarity . The algorithm attempts to capture important patterns in the text, while leaving out noise. This similarity has the following options:
lambda
|
The optimal value depends on both the collection and the query. The optimal value is around |
Type name: LMJelinekMercer
Default and Base Similarities
By default, Elasticsearch will use whatever similarity is configured as
default. However, the similarity functions queryNorm() and coord()
are not per-field. Consequently, for expert users wanting to change the
implementation used for these two methods, while not changing the
default, it is possible to configure a similarity with the name
base. This similarity will then be used for the two methods.
You can change the default similarity for all fields by putting the following setting into elasticsearch.yml:
index.similarity.default.type: BM25
145. Slow Log
Search Slow Log
Shard level slow search log allows to log slow search (query and fetch phases) into a dedicated log file.
Thresholds can be set for both the query phase of the execution, and fetch phase, here is a sample:
index.search.slowlog.threshold.query.warn: 10s
index.search.slowlog.threshold.query.info: 5s
index.search.slowlog.threshold.query.debug: 2s
index.search.slowlog.threshold.query.trace: 500ms
index.search.slowlog.threshold.fetch.warn: 1s
index.search.slowlog.threshold.fetch.info: 800ms
index.search.slowlog.threshold.fetch.debug: 500ms
index.search.slowlog.threshold.fetch.trace: 200ms
All of the above settings are dynamic and can be set per-index.
By default, none are enabled (set to -1). Levels (warn, info,
debug, trace) allow to control under which logging level the log
will be logged. Not all are required to be configured (for example, only
warn threshold can be set). The benefit of several levels is the
ability to quickly "grep" for specific thresholds breached.
The logging is done on the shard level scope, meaning the execution of a search request within a specific shard. It does not encompass the whole search request, which can be broadcast to several shards in order to execute. Some of the benefits of shard level logging is the association of the actual execution on the specific machine, compared with request level.
The logging file is configured by default using the following
configuration (found in logging.yml):
index_search_slow_log_file:
type: dailyRollingFile
file: ${path.logs}/${cluster.name}_index_search_slowlog.log
datePattern: "'.'yyyy-MM-dd"
layout:
type: pattern
conversionPattern: "[%d{ISO8601}][%-5p][%-25c] %m%n"
Index Slow log
The indexing slow log, similar in functionality to the search slow
log. The log file is ends with _index_indexing_slowlog.log. Log and
the thresholds are configured in the elasticsearch.yml file in the same
way as the search slowlog. Index slowlog sample:
index.indexing.slowlog.threshold.index.warn: 10s
index.indexing.slowlog.threshold.index.info: 5s
index.indexing.slowlog.threshold.index.debug: 2s
index.indexing.slowlog.threshold.index.trace: 500ms
index.indexing.slowlog.level: info
index.indexing.slowlog.source: 1000
All of the above settings are dynamic and can be set per-index.
By default Elasticsearch will log the first 1000 characters of the _source in
the slowlog. You can change that with index.indexing.slowlog.source. Setting
it to false or 0 will skip logging the source entirely an setting it to
true will log the entire source regardless of size.
The index slow log file is configured by default in the logging.yml
file:
index_indexing_slow_log_file:
type: dailyRollingFile
file: ${path.logs}/${cluster.name}_index_indexing_slowlog.log
datePattern: "'.'yyyy-MM-dd"
layout:
type: pattern
conversionPattern: "[%d{ISO8601}][%-5p][%-25c] %m%n"
146. Store
The store module allows you to control how index data is stored and accessed on disk.
File system storage types
There are different file system implementations or storage types. The best
one for the operating environment will be automatically chosen: mmapfs on
Windows 64bit, simplefs on Windows 32bit, and default (hybrid niofs and
mmapfs) for the rest.
This can be overridden for all indices by adding this to the
config/elasticsearch.yml file:
index.store.type: niofs
It is a static setting that can be set on a per-index basis at index creation time:
PUT /my_index
{
"settings": {
"index.store.type": "niofs"
}
}
experimental[This is an expert-only setting and may be removed in the future]
The following sections lists all the different storage types supported.
simplefs-
The Simple FS type is a straightforward implementation of file system storage (maps to Lucene
SimpleFsDirectory) using a random access file. This implementation has poor concurrent performance (multiple threads will bottleneck). It is usually better to use theniofswhen you need index persistence. niofs-
The NIO FS type stores the shard index on the file system (maps to Lucene
NIOFSDirectory) using NIO. It allows multiple threads to read from the same file concurrently. It is not recommended on Windows because of a bug in the SUN Java implementation. mmapfs-
The MMap FS type stores the shard index on the file system (maps to Lucene
MMapDirectory) by mapping a file into memory (mmap). Memory mapping uses up a portion of the virtual memory address space in your process equal to the size of the file being mapped. Before using this class, be sure you have allowed plenty of virtual address space. default_fs-
The
defaulttype is a hybrid of NIO FS and MMapFS, which chooses the best file system for each type of file. Currently only the Lucene term dictionary and doc values files are memory mapped to reduce the impact on the operating system. All other files are opened using LuceneNIOFSDirectory. Address space settings (Virtual memory) might also apply if your term dictionaries are large.
147. Translog
Changes to Lucene are only persisted to disk during a Lucene commit, which is a relatively heavy operation and so cannot be performed after every index or delete operation. Changes that happen after one commit and before another will be lost in the event of process exit or HW failure.
To prevent this data loss, each shard has a transaction log or write ahead log associated with it. Any index or delete operation is written to the translog after being processed by the internal Lucene index.
In the event of a crash, recent transactions can be replayed from the transaction log when the shard recovers.
An Elasticsearch flush is the process of performing a Lucene commit and starting a new translog. It is done automatically in the background in order to make sure the transaction log doesn’t grow too large, which would make replaying its operations take a considerable amount of time during recovery. It is also exposed through an API, though its rarely needed to be performed manually.
Flush settings
The following dynamically updatable settings control how often the in-memory buffer is flushed to disk:
index.translog.flush_threshold_size-
Once the translog hits this size, a flush will happen. Defaults to
512mb. index.translog.flush_threshold_ops-
After how many operations to flush. Defaults to
unlimited. index.translog.flush_threshold_period-
How long to wait before triggering a flush regardless of translog size. Defaults to
30m. index.translog.interval-
How often to check if a flush is needed, randomized between the interval value and 2x the interval value. Defaults to
5s.
Translog settings
The data in the transaction log is only persisted to disk when the translog is
fsynced and committed. In the event of hardware failure, any data written
since the previous translog commit will be lost.
By default, Elasticsearch commits the translog at the end of every index, delete,
update, or bulk request. In fact, Elasticsearch
will only report success of an index, delete, update, or bulk request to the
client after the transaction log has been successfully fsynced and committed
on the primary and on every allocated replica.
The following dynamically updatable per-index settings control the behaviour of the transaction log:
index.translog.sync_interval-
How often the translog is
fsynced to disk and committed, regardless of write operations. Defaults to5s. index.translog.durability-
Whether or not to
fsyncand commit the translog after every index, delete, update, or bulk request. This setting accepts the following parameters:request-
(default)
fsyncand commit after every request. In the event of hardware failure, all acknowledged writes will already have been committed to disk. async-
fsyncand commit in the background everysync_interval. In the event of hardware failure, all acknowledged writes since the last automatic commit will be discarded.
index.translog.fs.type-
Whether to buffer writes to the transaction log in memory or not. This setting accepts the following parameters:
buffered-
(default) Translog writes first go to a 64kB buffer in memory, and are only written to the disk when the buffer is full, or when an
fsyncis triggered by a write request or thesync_interval. simple-
Translog writes are written to the file system immediately, without buffering. However, these writes will only be persisted to disk when an
fsyncand commit is triggered by a write request or thesync_interval.
Testing
This section is about utilizing elasticsearch as part of your testing infrastructure.
Testing:
148. Java Testing Framework
Testing is a crucial part of your application, and as information retrieval itself is already a complex topic, there should not be any additional complexity in setting up a testing infrastructure, which uses elasticsearch. This is the main reason why we decided to release an additional file to the release, which allows you to use the same testing infrastructure we do in the elasticsearch core. The testing framework allows you to setup clusters with multiple nodes in order to check if your code covers everything needed to run in a cluster. The framework prevents you from writing complex code yourself to start, stop or manage several test nodes in a cluster. In addition there is another very important feature called randomized testing, which you are getting for free as it is part of the elasticsearch infrastructure.
148.1. why randomized testing?
The key concept of randomized testing is not to use the same input values for every testcase, but still be able to reproduce it in case of a failure. This allows to test with vastly different input variables in order to make sure, that your implementation is actually independent from your provided test data.
All of the tests are run using a custom junit runner, the RandomizedRunner provided by the randomized-testing project. If you are interested in the implementation being used, check out the RandomizedTesting webpage.
148.2. Using the elasticsearch test classes
First, you need to include the testing dependency in your project, along with the elasticsearch dependency you have already added. If you use maven and its pom.xml file, it looks like this
<dependencies>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-test-framework</artifactId>
<version>${lucene.version}</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.elasticsearch</groupId>
<artifactId>elasticsearch</artifactId>
<version>${elasticsearch.version}</version>
<scope>test</scope>
<type>test-jar</type>
</dependency>
</dependencies>
Replace the elasticsearch version and the lucene version with the corresponding elasticsearch version and its accompanying lucene release.
We provide a few classes that you can inherit from in your own test classes which provide:
-
pre-defined loggers
-
randomized testing infrastructure
-
a number of helper methods
148.3. unit tests
If your test is a well isolated unit test which doesn’t need a running elasticsearch cluster, you can use the ESTestCase. If you are testing lucene features, use ESTestCase and if you are testing concrete token streams, use the ESTokenStreamTestCase class. Those specific classes execute additional checks which ensure that no resources leaks are happening, after the test has run.
148.4. integration tests
These kind of tests require firing up a whole cluster of nodes, before the tests can actually be run. Compared to unit tests they are obviously way more time consuming, but the test infrastructure tries to minimize the time cost by only restarting the whole cluster, if this is configured explicitly.
The class your tests have to inherit from is ESIntegTestCase. By inheriting from this class, you will no longer need to start elasticsearch nodes manually in your test, although you might need to ensure that at least a certain number of nodes are up. The integration test behaviour can be configured heavily by specifying different system properties on test runs. See the TESTING.asciidoc documentation in the source repository for more information.
148.4.1. number of shards
The number of shards used for indices created during integration tests is randomized between 1 and 10 unless overwritten upon index creation via index settings.
The rule of thumb is not to specify the number of shards unless needed, so that each test will use a different one all the time. Alternatively you can override the numberOfShards() method. The same applies to the numberOfReplicas() method.
148.4.2. generic helper methods
There are a couple of helper methods in ESIntegTestCase, which will make your tests shorter and more concise.
refresh()
|
Refreshes all indices in a cluster |
ensureGreen()
|
Ensures a green health cluster state, waiting for relocations. Waits the default timeout of 30 seconds before failing. |
ensureYellow()
|
Ensures a yellow health cluster state, also waits for 30 seconds before failing. |
createIndex(name)
|
Creates an index with the specified name |
flush()
|
Flushes all indices in a cluster |
flushAndRefresh()
|
Combines |
forceMerge()
|
Waits for all relocations and force merges all indices in the cluster to one segment. |
indexExists(name)
|
Checks if given index exists |
admin()
|
Returns an |
clusterService()
|
Returns the cluster service java class |
cluster()
|
Returns the test cluster class, which is explained in the next paragraphs |
148.4.3. test cluster methods
The InternalTestCluster class is the heart of the cluster functionality in a randomized test and allows you to configure a specific setting or replay certain types of outages to check, how your custom code reacts.
ensureAtLeastNumNodes(n)
|
Ensure at least the specified number of nodes is running in the cluster |
ensureAtMostNumNodes(n)
|
Ensure at most the specified number of nodes is running in the cluster |
getInstance()
|
Get a guice instantiated instance of a class from a random node |
getInstanceFromNode()
|
Get a guice instantiated instance of a class from a specified node |
stopRandomNode()
|
Stop a random node in your cluster to mimic an outage |
stopCurrentMasterNode()
|
Stop the current master node to force a new election |
stopRandomNonMaster()
|
Stop a random non master node to mimic an outage |
buildNode()
|
Create a new elasticsearch node |
startNode(settings)
|
Create and start a new elasticsearch node |
148.4.4. Changing node settings
If you want to ensure a certain configuration for the nodes, which are started as part of the EsIntegTestCase, you can override the nodeSettings() method
public class Mytests extends ESIntegTestCase {
@Override
protected Settings nodeSettings(int nodeOrdinal) {
return Settings.builder().put(super.nodeSettings(nodeOrdinal))
.put("node.mode", "network")
.build();
}
}
148.4.5. Accessing clients
In order to execute any actions, you have to use a client. You can use the ESIntegTestCase.client() method to get back a random client. This client can be a TransportClient or a NodeClient - and usually you do not need to care as long as the action gets executed. There are several more methods for client selection inside of the InternalTestCluster class, which can be accessed using the ESIntegTestCase.internalCluster() method.
iterator()
|
An iterator over all available clients |
masterClient()
|
Returns a client which is connected to the master node |
nonMasterClient()
|
Returns a client which is not connected to the master node |
clientNodeClient()
|
Returns a client, which is running on a client node |
client(String nodeName)
|
Returns a client to a given node |
smartClient()
|
Returns a smart client |
148.4.6. Scoping
By default the tests are run with unique cluster per test suite. Of course all indices and templates are deleted between each test. However, sometimes you need to start a new cluster for each test - for example, if you load a certain plugin, but you do not want to load it for every test.
You can use the @ClusterScope annotation at class level to configure this behaviour
@ClusterScope(scope=TEST, numNodes=1)
public class CustomSuggesterSearchTests extends ESIntegTestCase {
// ... tests go here
}
The above sample configures the test to use a new cluster for each test method. The default scope is SUITE (one cluster for all test methods in the test). The numNodes settings allows you to only start a certain number of nodes, which can speed up test execution, as starting a new node is a costly and time consuming operation and might not be needed for this test.
148.4.7. Changing plugins via configuration
As elasticsearch is using JUnit 4, using the @Before and @After annotations is not a problem. However you should keep in mind, that this does not have any effect in your cluster setup, as the cluster is already up and running when those methods are run. So in case you want to configure settings - like loading a plugin on node startup - before the node is actually running, you should overwrite the nodePlugins() method from the ESIntegTestCase class and return the plugin classes each node should load.
@Override
protected Collection<Class<? extends Plugin>> nodePlugins() {
return pluginList(CustomSuggesterPlugin.class);
}
148.5. Randomized testing
The code snippets you saw so far did not show any trace of randomized testing features, as they are carefully hidden under the hood. However when you are writing your own tests, you should make use of these features as well. Before starting with that, you should know, how to repeat a failed test with the same setup, how it failed. Luckily this is quite easy, as the whole mvn call is logged together with failed tests, which means you can simply copy and paste that line and run the test.
148.5.1. Generating random data
The next step is to convert your test using static test data into a test using randomized test data. The kind of data you could randomize varies a lot with the functionality you are testing against. Take a look at the following examples (note, that this list could go on for pages, as a distributed system has many, many moving parts):
-
Searching for data using arbitrary UTF8 signs
-
Changing your mapping configuration, index and field names with each run
-
Changing your response sizes/configurable limits with each run
-
Changing the number of shards/replicas when creating an index
So, how can you create random data. The most important thing to know is, that you never should instantiate your own Random instance, but use the one provided in the RandomizedTest, from which all elasticsearch dependent test classes inherit from.
getRandom()
|
Returns the random instance, which can recreated when calling the test with specific parameters |
randomBoolean()
|
Returns a random boolean |
randomByte()
|
Returns a random byte |
randomShort()
|
Returns a random short |
randomInt()
|
Returns a random integer |
randomLong()
|
Returns a random long |
randomFloat()
|
Returns a random float |
randomDouble()
|
Returns a random double |
randomInt(max)
|
Returns a random integer between 0 and max |
between()
|
Returns a random between the supplied range |
atLeast()
|
Returns a random integer of at least the specified integer |
atMost()
|
Returns a random integer of at most the specified integer |
randomLocale()
|
Returns a random locale |
randomTimeZone()
|
Returns a random timezone |
randomFrom()
|
Returns a random element from a list/array |
In addition, there are a couple of helper methods, allowing you to create random ASCII and Unicode strings, see methods beginning with randomAscii, randomUnicode, and randomRealisticUnicode in the random test class. The latter one tries to create more realistic unicode string by not being arbitrary random.
If you want to debug a specific problem with a specific random seed, you can use the @Seed annotation to configure a specific seed for a test. If you want to run a test more than once, instead of starting the whole test suite over and over again, you can use the @Repeat annotation with an arbitrary value. Each iteration than gets run with a different seed.
148.6. Assertions
As many elasticsearch tests are checking for a similar output, like the amount of hits or the first hit or special highlighting, a couple of predefined assertions have been created. Those have been put into the ElasticsearchAssertions class. There is also a specific geo assertions in ElasticsearchGeoAssertions.
assertHitCount()
|
Checks hit count of a search or count request |
assertAcked()
|
Ensure the a request has been acknowledged by the master |
assertSearchHits()
|
Asserts a search response contains specific ids |
assertMatchCount()
|
Asserts a matching count from a percolation response |
assertFirstHit()
|
Asserts the first hit hits the specified matcher |
assertSecondHit()
|
Asserts the second hit hits the specified matcher |
assertThirdHit()
|
Asserts the third hits hits the specified matcher |
assertSearchHit()
|
Assert a certain element in a search response hits the specified matcher |
assertNoFailures()
|
Asserts that no shard failures have occurred in the response |
assertFailures()
|
Asserts that shard failures have happened during a search request |
assertHighlight()
|
Assert specific highlights matched |
assertSuggestion()
|
Assert for specific suggestions |
assertSuggestionSize()
|
Assert for specific suggestion count |
assertThrows()
|
Assert a specific exception has been thrown |
Common matchers
hasId()
|
Matcher to check for a search hit id |
hasType()
|
Matcher to check for a search hit type |
hasIndex()
|
Matcher to check for a search hit index |
hasScore()
|
Matcher to check for a certain score of a hit |
hasStatus()
|
Matcher to check for a certain |
Usually, you would combine assertions and matchers in your test like this
SearchResponse seearchResponse = client().prepareSearch() ...;
assertHitCount(searchResponse, 4);
assertFirstHit(searchResponse, hasId("4"));
assertSearchHits(searchResponse, "1", "2", "3", "4");
Glossary of terms
- analysis
-
Analysis is the process of converting full text to terms. Depending on which analyzer is used, these phrases:
FOO BAR,Foo-Bar,foo,barwill probably all result in the termsfooandbar. These terms are what is actually stored in the index. + A full text query (not a term query) forFoO:bARwill also be analyzed to the termsfoo,barand will thus match the terms stored in the index. + It is this process of analysis (both at index time and at search time) that allows elasticsearch to perform full text queries. + Also see text and term. - cluster
-
A cluster consists of one or more nodes which share the same cluster name. Each cluster has a single master node which is chosen automatically by the cluster and which can be replaced if the current master node fails.
- document
-
A document is a JSON document which is stored in elasticsearch. It is like a row in a table in a relational database. Each document is stored in an index and has a type and an id. + A document is a JSON object (also known in other languages as a hash / hashmap / associative array) which contains zero or more fields, or key-value pairs. + The original JSON document that is indexed will be stored in the
_sourcefield, which is returned by default when getting or searching for a document. - id
-
The ID of a document identifies a document. The
index/type/idof a document must be unique. If no ID is provided, then it will be auto-generated. (also see routing) - field
-
A document contains a list of fields, or key-value pairs. The value can be a simple (scalar) value (eg a string, integer, date), or a nested structure like an array or an object. A field is similar to a column in a table in a relational database. + The mapping for each field has a field type (not to be confused with document type) which indicates the type of data that can be stored in that field, eg
integer,string,object. The mapping also allows you to define (amongst other things) how the value for a field should be analyzed. - index
-
An index is like a database in a relational database. It has a mapping which defines multiple types. + An index is a logical namespace which maps to one or more primary shards and can have zero or more replica shards.
- mapping
-
A mapping is like a schema definition in a relational database. Each index has a mapping, which defines each type within the index, plus a number of index-wide settings. + A mapping can either be defined explicitly, or it will be generated automatically when a document is indexed.
- node
-
A node is a running instance of elasticsearch which belongs to a cluster. Multiple nodes can be started on a single server for testing purposes, but usually you should have one node per server. + At startup, a node will use unicast to discover an existing cluster with the same cluster name and will try to join that cluster.
- primary shard
-
Each document is stored in a single primary shard. When you index a document, it is indexed first on the primary shard, then on all replicas of the primary shard. + By default, an index has 5 primary shards. You can specify fewer or more primary shards to scale the number of documents that your index can handle. + You cannot change the number of primary shards in an index, once the index is created. + See also routing
- replica shard
-
Each primary shard can have zero or more replicas. A replica is a copy of the primary shard, and has two purposes: +
-
increase failover: a replica shard can be promoted to a primary shard if the primary fails
-
increase performance: get and search requests can be handled by primary or replica shards. + By default, each primary shard has one replica, but the number of replicas can be changed dynamically on an existing index. A replica shard will never be started on the same node as its primary shard.
-
- routing
-
When you index a document, it is stored on a single primary shard. That shard is chosen by hashing the
routingvalue. By default, theroutingvalue is derived from the ID of the document or, if the document has a specified parent document, from the ID of the parent document (to ensure that child and parent documents are stored on the same shard). + This value can be overridden by specifying aroutingvalue at index time, or a routing field in the mapping. - shard
-
A shard is a single Lucene instance. It is a low-level “worker” unit which is managed automatically by elasticsearch. An index is a logical namespace which points to primary and replica shards. + Other than defining the number of primary and replica shards that an index should have, you never need to refer to shards directly. Instead, your code should deal only with an index. + Elasticsearch distributes shards amongst all nodes in the cluster, and can move shards automatically from one node to another in the case of node failure, or the addition of new nodes.
- source field
-
By default, the JSON document that you index will be stored in the
_sourcefield and will be returned by all get and search requests. This allows you access to the original object directly from search results, rather than requiring a second step to retrieve the object from an ID. + Note: the exact JSON string that you indexed will be returned to you, even if it contains invalid JSON. The contents of this field do not indicate anything about how the data in the object has been indexed. - term
-
A term is an exact value that is indexed in elasticsearch. The terms
foo,Foo,FOOare NOT equivalent. Terms (i.e. exact values) can be searched for using term queries.
See also text and analysis. - text
-
Text (or full text) is ordinary unstructured text, such as this paragraph. By default, text will be analyzed into terms, which is what is actually stored in the index. + Text fields need to be analyzed at index time in order to be searchable as full text, and keywords in full text queries must be analyzed at search time to produce (and search for) the same terms that were generated at index time. + See also term and analysis.
- type
-
A type is like a table in a relational database. Each type has a list of fields that can be specified for documents of that type. The mapping defines how each field in the document is analyzed.
Release Notes
This section summarizes the changes in each release.
149. 2.3.0 Release Notes
Breaking changes
- Allocation
-
-
Speed up shard balancer by reusing shard model while moving shards that can no longer be allocated to a node #16926
-
- Mapping
Deprecations
- Geo
- Plugin Discovery Multicast
- Query DSL
New features
Enhancements
- Allocation
- Cache
-
-
Make TermsQuery considered costly. #16851
-
- Cluster
- Core
- Exceptions
- Geo
-
-
Upgrade GeoPointField to use Lucene 5.5 PrefixEncoding #16482
-
- Internal
- Java API
- Logging
- Mapping
-
-
Add deprecation logging for mapping transform #16952 (issue: #16910)
-
Expose the reason why a mapping merge is issued. #16059 (issue: #15989)
-
Add sub-fields support to
boolfields. #15636 (issue: #6587) -
Improve cross-type dynamic mapping updates. #15633 (issue: #15568)
-
Make mapping updates more robust. #15539
-
- Network
- Packaging
- Plugin Cloud Azure
- Plugin Discovery EC2
- Plugin Mapper Attachment
- REST
-
-
More robust handling of CORS HTTP Access Control #16436
-
- Reindex API
- Scripting
-
-
Logs old script params use to the Deprecation Logger #16950 (issue: #16910)
-
Remove suppressAccessChecks permission for Groovy script plugin #16839 (issue: #16527)
-
Class permission for Groovy references #16660 (issue: #16657)
-
Scripting: Allow to get size of array in mustache #16193
-
Added plumbing for compile time script parameters #16163 (issue: #15464)
-
Enhancements to the mustache script engine #15661
-
- Search
- Settings
-
-
Log warning if max file descriptors too low #16506
-
Bug fixes
- Aggregations
-
-
Setting other bucket on empty aggregation #17264 (issue: #16546)
-
Build empty extended stats aggregation if no docs collected for bucket #16972 (issues: #16812, #9544)
-
Set meta data for pipeline aggregations #16516 (issue: #16484)
-
Filter(s) aggregation should create weights only once. #15998
-
Make
missingon terms aggs work with all execution modes. #15746 (issue: #14882) -
Fix NPE in Derivative Pipeline when current bucket value is null #14745
-
- Aliases
- Allocation
- Analysis
- Bulk
- CAT API
- CRUD
- Cache
-
-
Handle closed readers in ShardCoreKeyMap #16027
-
- Cluster
- Core
- Fielddata
- Geo
- Highlighting
- Inner Hits
- Internal
- Java API
- Logging
- Mapping
- Network
- Packaging
- Parent/Child
- Percolator
- Plugin Cloud Azure
-
-
Fix calling ensureOpen() on the wrong directory #16383
-
- Plugin Discovery GCE
- Query DSL
-
-
Fix FunctionScore equals/hashCode to include minScore and friends #15676
-
- REST
- Recovery
- Reindex API
- Scripting
-
-
Check that _value is used in aggregations script before setting value to specialValue #17091 (issue: #14262)
-
Add permission to access sun.reflect.MethodAccessorImpl from Groovy scripts #16540 (issue: #16536)
-
Fixes json generation for scriptsort w/ deprecated params #16261 (issue: #16260)
-
Security permissions for Groovy closures #16196 (issues: #16194, #248)
-
- Search
- Settings
-
-
TransportClient should use updated setting for initialization of modules and service #16095
-
- Snapshot/Restore
- Stats
- Task Manager
- Translog
-
-
Call ensureOpen on Translog#newView() to prevent IllegalStateException #17191
-
Make sure IndexShard is active during recovery so it gets its fair share of the indexing buffer #16209 (issue: #16206)
-
Avoid circular reference in exception #15952 (issue: #15941)
-
Initialize translog before scheduling the sync to disk #15881
-
Catch tragic even inside the checkpoint method rather than on the caller side #15825
-
Never delete translog-N.tlog file when creation fails #15788
-
Close recovered translog readers if createWriter fails #15762 (issue: #15754)
-
- Tribe Node
Regressions
- Analysis
- Plugin Cloud Azure
- REST
Upgrades
- Core
- Plugin Cloud Azure
- Plugin Discovery Azure
- Scripting
151. 2.2.1 Release Notes
Enhancements
Bug fixes
- Aggregations
- Aliases
- Bulk
- Inner Hits
- Logging
- Parent/Child
- Percolator
- Plugin Cloud Azure
-
-
Fix calling ensureOpen() on the wrong directory #16383
-
- Plugin Discovery GCE
- Query DSL
- REST
-
-
Remove detect_noop from REST spec #16386
-
- Scripting
- Snapshot/Restore
- Stats
- Tribe Node
-
-
Passthrough environment and network settings to tribe client nodes #16893
-
Regressions
- Plugin Cloud Azure
Upgrades
152. 2.2.0 Release Notes
Breaking changes
- Index APIs
- Scripting
Deprecations
- Java API
- Plugin Discovery Multicast
- Query DSL
- Search
New features
Enhancements
- Aliases
- Allocation
- Analysis
- CAT API
- Cluster
-
-
Safe cluster state task notifications #15777
-
Reroute once per batch of shard failures #15510
-
Add callback for publication of new cluster state #15494 (issue: #15482)
-
Split cluster state update tasks into roles #15159
-
Use general cluster state batching mechanism for shard started #15023 (issues: #14725, #14899)
-
Use general cluster state batching mechanism for shard failures #15016 (issues: #14725, #14899)
-
Set an newly created IndexShard’s ShardRouting before exposing it to operations #14918 (issue: #10708)
-
Uniform exceptions for TransportMasterNodeAction #14737
-
- Core
-
-
If we can’t get a MAC address for the node, use a dummy one #15266 (issue: #10099)
-
Simplify IndexingMemoryController#checkIdle #15252 (issue: #15251)
-
IndexingMemoryController should not track shard index states #15251 (issues: #13918, #15225)
-
Make PerThreadIDAndVersionLookup per-segment #14070
-
Verify Checksum once it has been fully written to fail as soon as possible #13896
-
- Discovery
- Exceptions
- Fielddata
-
-
Update GeoPoint FieldData for GeoPointV2 #14345
-
- Geo
- Index APIs
- Index Templates
- Internal
-
-
Simplify the Text API. #15511
-
Simpler using compressed oops flag representation #15509 (issue: #15489)
-
Info on compressed ordinary object pointers #15489 (issues: #13187, #455)
-
Explicitly log cluster state update failures #15428 (issues: #14899, #15016, #15023)
-
Use transport service to handle RetryOnReplicaException to execute replica action on the current node #15363
-
Make IndexShard operation be more explicit about whether they are expected to run on a primary or replica #15282
-
Avoid trace logging allocations in TransportBroadcastByNodeAction #15221
-
Only trace log shard not available exceptions #14950 (issue: #14927)
-
Transport options should be immutable #14760
-
Fix dangling comma in ClusterBlock#toString #14483
-
Improve some logging around master election and cluster state #14481
-
Add System#exit(), Runtime#exit() and Runtime#halt() to forbidden APIs #14473 (issue: #12596)
-
Simplify XContent detection. #14472
-
Add threadgroup isolation. #14353
-
Cleanup plugin security #14311
-
Add workaround for JDK-8014008 #14274
-
Refactor retry logic for TransportMasterNodeAction #14222
-
Remove MetaDataSerivce and it’s semaphores #14159 (issue: #1296)
-
Cleanup IndexMetaData #14119
-
TransportNodesAction shouldn’t hold on to cluster state #13948
-
Add SpecialPermission to guard exceptions to security policy. #13854
-
Clean up scripting permissions. #13844
-
Factor groovy out of core into lang-groovy #13834 (issue: #13725)
-
Factor expressions scripts out to lang-expression plugin #13726 (issue: #13725)
-
- Java API
-
-
TransportClient: Add exception when using plugin.types, to help migration to addPlugin #15943 (issue: #15693)
-
Align handling of interrupts in BulkProcessor #15527 (issue: #14833)
-
BulkProcessor backs off exponentially by default #15513 (issue: #14829)
-
Allow to get and set ttl as a time value/string #15239 (issue: #15047)
-
Reject refresh usage in bulk items when using and fix NPE when no source #15082 (issue: #7361)
-
BulkProcessor retries after request handling has been rejected due to a full thread pool #14829 (issue: #14620)
-
- Logging
- Mapping
- Network
- Packaging
-
-
Default standard output to the journal in systemd #16159 (issues: #15315, #16134)
-
Use egrep instead of grep -E for Solaris #15755 (issue: #15628)
-
punch thru symlinks when loading plugins/modules #15311
-
set ActiveProcessLimit=1 on windows #15055
-
set RLIMIT_NPROC = 0 on bsd/os X systems. #15039
-
Drop ability to execute on Solaris #14200
-
Packaging: change permissions/ownership of config dir #14017 (issue: #11016)
-
Release: Fix package repo path to only consist of major version #13971 (issue: #12493)
-
Nuke ES_CLASSPATH appending, JarHell fail on empty classpath elements #13880 (issues: #13812, #13864)
-
Get lang-javascript, lang-python, securemock ready for script refactoring #13695
-
Remove some bogus permissions only needed for tests. #13620
-
Remove java.lang.reflect.ReflectPermission "suppressAccessChecks" #13603
-
- Plugin Cloud AWS
- Plugin Cloud Azure
- Plugin Cloud GCE
- Plugin Discovery EC2
- Plugin Mapper Attachment
-
-
Backport mapper-attachments plugin to 2.x #14902
-
- Plugin Repository S3
- Plugins
-
-
Expose http.type setting, and collapse al(most all) modules relating to transport/http #15434 (issue: #14148)
-
Ban RuntimePermission("getClassLoader") #15253
-
Add nicer error message when a plugin descriptor is missing #15200 (issue: #15197)
-
Don’t be lenient in PluginService#processModule(Module) #14306
-
Check "plugin already installed" before jar hell check. #14207 (issue: #14205)
-
Plugin script to set proper plugin bin dir attributes #14088 (issue: #11016)
-
Plugin script to set proper plugin config dir attributes #14048 (issue: #11016)
-
- Query DSL
- REST
- Recovery
- Scripting
- Search
- Snapshot/Restore
- Stats
- Top Hits
- Translog
-
-
Check for tragic event on all kinds of exceptions not only ACE and IOException #15535
-
- Tribe Node
Bug fixes
- Aggregations
-
-
Filter(s) aggregation should create weights only once. #15998
-
Make
missingon terms aggs work with all execution modes. #15746 (issue: #14882) -
Run pipeline aggregations for empty buckets added in the Range Aggregation #15519 (issue: #15471)
-
[Children agg] fix bug that prevented all child docs from being evaluated #15457
-
Correct typo in class name of StatsAggregator #15321 (issue: #14730)
-
Fix significant terms reduce for long terms #14948 (issue: #13522)
-
Pass extended bounds into HistogramAggregator when creating an unmapped aggregator #14742 (issue: #14735)
-
Added correct generic type parameter on ScriptedMetricBuilder #14018 (issue: #13986)
-
- Aliases
- Allocation
-
-
Prevent peer recovery from node with older version #15775
-
Fix calculation of next delay for delayed shard allocation #14765
-
Take ignored unallocated shards into account when making allocation decision #14678 (issue: #14670)
-
Only allow rebalance operations to run if all shard store data is available #14591 (issue: #14387)
-
Delayed allocation can miss a reroute #14494 (issues: #14010, #14011, #14445)
-
Check rebalancing constraints when shards are moved from a node they can no longer remain on #14259 (issue: #14057)
-
- Bulk
- CAT API
-
-
Properly set indices and indicesOptions on subrequest made by /_cat/indices #14360
-
- CRUD
- Cluster
- Core
-
-
BitSetFilterCache duplicates its content. #15836 (issue: #15820)
-
Limit the max size of bulk and index thread pools to bounded number of processors #15585 (issue: #15582)
-
AllTermQuery’s scorer should skip segments that never saw the requested term #15506
-
Include root-cause exception when we fail to change shard’s index buffer #14867
-
Restore thread interrupt flag after an InterruptedException #14799 (issue: #14798)
-
Use fresh index settings instead of relying on @IndexSettings #14578 (issue: #14319)
-
Record all bytes of the checksum in VerifyingIndexOutput #13923 (issues: #13848, #13896)
-
When shard becomes active again, immediately increase its indexing buffer #13918 (issue: #13802)
-
- Engine
-
-
Never wrap searcher for internal engine operations #14071
-
- Exceptions
- Fielddata
-
-
Don’t cache top level field data for fields that don’t exist #14693
-
- Geo
- Highlighting
- Index APIs
-
-
Field stats: Index constraints should remove indices in the response if the field to evaluate is empty #14868
-
Field stats: Fix NPE for index constraint on empty index #14841
-
Field stats: Added
formatoption for index constraints #14823 (issue: #14804) -
Restore previous optimize transport action name for bw comp #14221 (issue: #13778)
-
- Internal
-
-
Log uncaught exceptions from scheduled once tasks #15824 (issue: #15814)
-
Make sure the remaining delay of unassigned shard is updated with every reroute #14890 (issue: #14808)
-
Throw a meaningful error when loading metadata and an alias and index have the same name #14842 (issue: #14706)
-
fixup issues with 32-bit jvm #14609
-
Failure to update the cluster state with the recovered state should make sure it will be recovered later #14485
-
Properly bind ClassSet extensions as singletons #14232 (issue: #14194)
-
Streamline top level reader close listeners and forbid general usage #14084
-
Gateway: a race condition can prevent the initial cluster state from being recovered #13997
-
Verify actually written checksum in VerifyingIndexOutput #13848
-
Remove all setAccessible in tests and forbid #13539
-
Remove easy uses of setAccessible in tests. #13537
-
Ban setAccessible from core code, restore monitoring stats under java 9 #13531 (issue: #13527)
-
- Logging
- Mapping
-
-
Fix initial sizing of BytesStreamOutput. #15864 (issue: #15789)
-
MetaDataMappingService should call MapperService.merge with the original mapping update. #15508
-
Fix copy_to when the target is a dynamic object field. #15385 (issue: #11237)
-
Only text fields should accept analyzer and term vector settings. #15308
-
Mapper parsers should not check for a
tokenizedproperty. #15289 -
Validate that fields are defined only once. #15243 (issue: #15057)
-
Check mapping compatibility up-front. #15175 (issue: #15049)
-
Don’t treat default as a regular type. #15156 (issue: #15049)
-
Treat mappings at an index-level feature. #15142
-
Multi field names may not contain dots #15118 (issue: #14957)
-
Mapping: Allows upgrade of indexes with only search_analyzer specified #14677 (issue: #14383)
- Packaging
- Plugin Cloud AWS
- Plugin Delete By Query
- Plugin Mapper Attachment
-
-
Fix toXContent() for mapper attachments field #15110
-
- Plugin Repository S3
- Plugins
- Query DSL
- REST
-
-
Throw exception when trying to write map with null keys #15479 (issue: #14346)
-
XContentFactory.xContentType: allow for possible UTF-8 BOM for JSON XContentType #14611 (issue: #14442)
-
Restore support for escaped / as part of document id #14216 (issues: #13665, #13691, #14177)
-
Add missing REST spec for
detect_noop#14004 -
Expose nodes operation timeout in REST API #13981
-
Ensure XContent is consistent across platforms #13816
-
- Recovery
- Scripting
- Search
- Settings
-
-
TransportClient should use updated setting for initialization of modules and service #16095
-
- Shadow Replicas
- Snapshot/Restore
- Stats
- Translog
-
-
Make sure IndexShard is active during recovery so it gets its fair share of the indexing buffer #16209 (issue: #16206)
-
Avoid circular reference in exception #15952 (issue: #15941)
-
Initialize translog before scheduling the sync to disk #15881
-
Translog base flushes can be disabled after replication relocation or slow recovery #15830 (issues: #10624, #15814)
-
Catch tragic even inside the checkpoint method rather than on the caller side #15825
-
Never delete translog-N.tlog file when creation fails #15788
-
Close recovered translog readers if createWriter fails #15762 (issue: #15754)
-
Fail and close translog hard if writing to disk fails #15420 (issue: #15333)
-
Prevent writing to closed channel if translog is already closed #15012 (issue: #14866)
-
Don’t delete temp recovered checkpoint file if it was renamed #14872 (issue: #14695)
-
Translog recovery can repeatedly fail if we run out of disk #14695
-
- Tribe Node
Regressions
- Analysis
- Internal
- Plugin Cloud Azure
- Query DSL
Upgrades
- Core
-
-
Upgrade to lucene-5.4.1. #16160
-
Upgrade to lucene-5.4.0. #15477
-
Upgrade Lucene to 5.4.0-snapshot-1715952 #14951
-
Upgrade Lucene to 5.4.0-snapshot-1714615 #14784
-
Upgrade to lucene-5.4.0-snapshot-1712973. #14619
-
update to lucene-5.4.x-snapshot-1711508 #14398
-
Upgrade to lucene-5.4-snapshot-1710880. #14320
-
Upgrade to lucene-5.4-snapshot-1708254. #14074
-
upgrade lucene to r1702265 #13439
-
Upgrade master to lucene 5.4-snapshot r1701068 #13324
-
- Geo
- Network
-
-
Upgrade Netty to 3.10.5.final #14105
-
- Plugin Discovery Azure
- Plugin Discovery EC2
-
-
Upgrade to aws 1.10.33 #14672
-
- Plugin Lang JS
-
-
upgrade rhino for plugins/lang-javascript #14466
-
153. 2.1.2 Release Notes
Enhancements
- Internal
- Plugin Cloud Azure
- Translog
-
-
Check for tragic event on all kinds of exceptions not only ACE and IOException #15535
-
Bug fixes
- Aggregations
- Aliases
- Allocation
-
-
Prevent peer recovery from node with older version #15775
-
- Cluster
- Core
- Highlighting
- Internal
- Mapping
- Packaging
- Query DSL
-
-
Fix FunctionScore equals/hashCode to include minScore and friends #15676
-
- Recovery
-
-
sync translog to disk after recovery from primary #15832
-
- Stats
- Translog
-
-
Make sure IndexShard is active during recovery so it gets its fair share of the indexing buffer #16209 (issue: #16206)
-
Avoid circular reference in exception #15952 (issue: #15941)
-
Initialize translog before scheduling the sync to disk #15881
-
Translog base flushes can be disabled after replication relocation or slow recovery #15830 (issues: #10624, #15814)
-
Catch tragic even inside the checkpoint method rather than on the caller side #15825
-
Never delete translog-N.tlog file when creation fails #15788
-
Close recovered translog readers if createWriter fails #15762 (issue: #15754)
-
- Tribe Node
Regressions
154. 2.1.1 Release Notes
Enhancements
- Aggregations
-
-
[Children agg] fix bug that prevented all child docs from being evaluated #15457
-
- Core
- Index Templates
- Mapping
Bug fixes
- Index APIs
-
-
Field stats: Index constraints should remove indices in the response if the field to evaluate is empty #14868
-
- Internal
- Mapping
- Search
- Translog
- Tribe Node
Regressions
155. 2.1.0 Release Notes
Also see Breaking changes in 2.1 for important changes in this release.
Breaking changes
Deprecations
- Java API
- Parent/Child
-
-
Deprecate
score_typeoption in favour of thescore_modeoption #13478
-
- Query DSL
- Search
New features
- Aggregations
- Analysis
-
-
Lithuanian analysis #13244
-
- Geo
Enhancements
- Allocation
- CAT API
- Core
-
-
Verify Checksum once it has been fully written to fail as soon as possible #13896
-
- Exceptions
-
-
Deduplicate cause if already contained in shard failures #14432
-
Give a better exception when running from freebsd jail without enforce_statfs=1 #14135 (issue: #12018)
-
Make root_cause of field conflicts more obvious #13976 (issue: #12839)
-
Use a dedicated id to serialize EsExceptions instead of it’s class name. #13629
-
Improve error message of ClassCastExceptions #12821 (issue: #12135)
-
- Geo
- Index APIs
- Index Templates
- Internal
-
-
Fix dangling comma in ClusterBlock#toString #14483
-
Improve some logging around master election and cluster state #14481
-
Add workaround for JDK-8014008 #14274
-
Cleanup IndexMetaData #14119
-
More helpful error message on parameter order #13737
-
Cleanup InternalClusterInfoService #13543
-
Remove and forbid use of com.google.common.base.Throwables #13409 (issue: #13224)
-
Remove cyclic dependencies between IndexService and FieldData / BitSet caches #13381
-
Remove and forbid use of com.google.common.base.Objects #13355 (issue: #13224)
-
Remove and forbid use of com.google.common.collect.ImmutableList #13227 (issue: #13224)
-
Remove and forbid use of com.google.common.collect.Lists #13170
-
Remove unused code from query_string parser and settings #13098
-
Consolidate duplicate logic in RoutingTable all*ShardsGrouped #13082 (issue: #13081)
-
Turn DestructiveOperations.java into a Guice module. #13046 (issue: #4665)
-
Enable indy (invokedynamic) compile flag for Groovy scripts by default #8201 (issue: #8184)
-
- Java API
-
-
Prevents users from building a BulkProcessor with a null client #12497
-
- Logging
- Packaging
-
-
Drop ability to execute on Solaris #14200
-
Nuke ES_CLASSPATH appending, JarHell fail on empty classpath elements #13880 (issues: #13812, #13864)
-
improve seccomp syscall filtering #13829
-
Block process execution with seccomp on linux/amd64 #13753
-
Remove JAVA_HOME detection from the debian init script #13514 (issues: #13403, #9774)
-
- Plugin Cloud AWS
- Plugin Cloud GCE
- Plugin Discovery EC2
- Plugin Repository S3
- Plugins
-
-
Don’t be lenient in PluginService#processModule(Module) #14306
-
Adds a validation for plugins script to check if java is set #13633 (issue: #13613)
-
Plugins: Removed plugin.types #13055
-
Improve java version comparison and explicitly enforce a version format #13010 (issues: #12441, #13009)
-
Output plugin info only in verbose mode #12908 (issue: #12907)
-
- Query DSL
-
-
Internal: simplify filtered query conversion to lucene query #13312 (issue: #13272)
-
Remove unsupported
rewritefrom multi_match query builder #13073 (issue: #13069) -
Remove unsupported
rewriteoption from match query builder #13069 -
Make FunctionScore work on unmapped field with
missingparameter #13060 (issue: #10948)
-
- Scripting
- Scroll
-
-
Optimize sorted scroll when sorting by
_doc. #12983
-
- Search
- Search Templates
- Snapshot/Restore
- Stats
Bug fixes
- Aggregations
-
-
Pass extended bounds into HistogramAggregator when creating an unmapped aggregator #14742 (issue: #14735)
-
Added correct generic type parameter on ScriptedMetricBuilder #14018 (issue: #13986)
-
Pipeline Aggregations at the root of the agg tree are now validated #13475 (issue: #13179)
-
Estimate HyperLogLog bias via k-NN regression #13243
-
- Allocation
-
-
Fix calculation of next delay for delayed shard allocation #14765
-
Take ignored unallocated shards into account when making allocation decision #14678 (issue: #14670)
-
Only allow rebalance operations to run if all shard store data is available #14591 (issue: #14387)
-
Delayed allocation can miss a reroute #14494 (issues: #14010, #14011, #14445)
-
Check rebalancing constraints when shards are moved from a node they can no longer remain on #14259 (issue: #14057)
-
- CAT API
-
-
Properly set indices and indicesOptions on subrequest made by /_cat/indices #14360
-
- CRUD
- Cluster
- Core
-
-
Use fresh index settings instead of relying on @IndexSettings #14578 (issue: #14319)
-
Fork Lucene PatternTokenizer to apply LUCENE-6814 (closes #13721) #14571 (issue: #13721)
-
Record all bytes of the checksum in VerifyingIndexOutput #13923 (issues: #13848, #13896)
-
When shard becomes active again, immediately increase its indexing buffer #13918 (issue: #13802)
-
LoggingRunnable.run should catch and log all errors, not just Exception? #13718 (issue: #13487)
-
- Exceptions
- Fielddata
-
-
Don’t cache top level field data for fields that don’t exist #14693
-
- Geo
- Index APIs
- Index Templates
- Internal
-
-
fix
mvn verifyon jigsaw with 2.1 #14750 -
fixup issues with 32-bit jvm #14609
-
Failure to update the cluster state with the recovered state should make sure it will be recovered later #14485
-
Gateway: a race condition can prevent the initial cluster state from being recovered #13997
-
Verify actually written checksum in VerifyingIndexOutput #13848
-
An inactive shard is activated by triggered synced flush #13802
-
- Logging
- Mapping
- Packaging
- Parent/Child
-
-
Remove unnecessary usage of extra index searchers #12864
-
- Plugin Delete By Query
- Plugins
- REST
- Search
-
-
Fix the quotes in the explain message for a script score function without parameters #11398
-
- Settings
-
-
ByteSizeValue.equals should normalize units #13784
-
- Snapshot/Restore
- Stats
- Translog
Regressions
Upgrades
- Core
- Geo
- Internal
- Plugin Cloud AWS
- Plugin Discovery EC2
-
-
Upgrade to aws 1.10.33 #14672
-
156. 2.0.2 Release Notes
Bug fixes
- Aggregations
-
-
[Children agg] fix bug that prevented all child docs from being evaluated #15457
-
- Index APIs
- Internal
- Mapping
- Translog
157. 2.0.1 Release Notes
Bug fixes
- Aggregations
- Allocation
-
-
Fix calculation of next delay for delayed shard allocation #14765
-
Take ignored unallocated shards into account when making allocation decision #14678 (issue: #14670)
-
Only allow rebalance operations to run if all shard store data is available #14591 (issue: #14387)
-
Delayed allocation can miss a reroute #14494 (issues: #14010, #14011, #14445)
-
- CAT API
-
-
Properly set indices and indicesOptions on subrequest made by /_cat/indices #14360
-
- Cluster
- Core
- Fielddata
-
-
Don’t cache top level field data for fields that don’t exist #14693
-
- Index APIs
- Mapping
- Plugin Delete By Query
- Plugins
- REST
- Scripting
- Stats
- Translog
-
-
Translog recovery can repeatedly fail if we run out of disk #14695
-
Regressions
158. 2.0.0 Release Notes
Breaking changes
- Packaging
Deprecations
- Mapping
Enhancements
Bug fixes
Upgrades
- Network
-
-
Upgrade Netty to 3.10.5.final #14105
-
159. 2.0.0-rc1 Release Notes
Enhancements
Bug fixes
- CRUD
- Core
- Exceptions
-
-
Prevent losing stacktraces when exceptions occur #13587
-
- Geo
- Index APIs
- Internal
-
-
An inactive shard is activated by triggered synced flush #13802
-
- Logging
- Packaging
- Plugins
- Settings
-
-
ByteSizeValue.equals should normalize units #13784
-
- Translog
- Tribe Node
-
-
Increment tribe node version on updates #13566
-
160. 2.0.0-beta2 Release Notes
Breaking changes
- CAT API
- Internal
- Mapping
- Settings
Deprecations
- Geo
-
-
Refactor geo_point validate_* and normalize_* for 1.7 #12300
-
- Query DSL
Enhancements
Bug fixes
- Aggregations
- Allocation
-
-
Take relocating shard into consideration during awareness allocation #13512 (issue: #12551)
-
Take Shard data path into account in DiskThresholdDecider #13195 (issue: #13106)
-
Expand ClusterInfo to provide min / max disk usage for allocation decider #13163 (issue: #13106)
-
Take initializing shards into consideration during awareness allocation #12551 (issue: #12522)
-
- Core
-
-
Engine: refresh before translog commit #13414 (issue: #13379)
-
Fix exception handling for unavailable shards in broadcast replication action #13341 (issue: #13068)
-
Call
beforeIndexShardCreatedlistener earlier increateShard#13153 -
Detect duplicate settings keys on startup #13086 (issue: #13079)
-
Don’t check if directory is present to prevent races #13049
-
- Engine
- Inner Hits
- Internal
- Mapping
-
-
Split the _parent field mapping’s field type into two field types #13399 (issue: #13169)
-
Fix numerous checks for equality and compatibility in mapper field types #13206 (issues: #13112, #8871)
-
Fix doc parser to still pre/post process metadata fields on disabled type #13137 (issue: #13017)
-
Fix document parsing to properly ignore entire type when disabled #13085 (issue: #13017)
-
update_all_typesmissing from REST spec and tests [ISSUE] #12840
-
- Nested Docs
-
-
Nested query should only use bitset cache for parent filter #13087
-
- Network
- Packaging
- Plugins
- Query DSL
- Recovery
-
-
Failed to properly ack translog ops during wait on mapping changes #13535
-
- Scripting
- Search
- Settings
-
-
Fix
discovery.zen.join_timeoutdefault value logic #13162
-
- Shadow Replicas
- Snapshot/Restore
- Stats
-
-
Remove the
networkoption from nodes info/stats [ISSUE] #12889
-
Upgrades
161. 2.0.0-beta1 Release Notes
Breaking changes
- Aggregations
-
-
Removed unused factor parameter in DateHistogramBuilder #12850 (issue: #6490)
-
Aggregation: Removed Old Script Java API from metrics aggregations #12236
-
Change the default
min_doc_countto 0 on histograms. #10904 -
Speed up include/exclude in terms aggregations with regexps, using Lucene regular expressions #10418 (issues: #7526, #9848)
-
Clean up time zone options for
date_histogram#9637 (issue: #9062) -
Add
offsetto date_histogram, replacingpre_offsestandpost_offset#9597 (issue: #9062) -
Facets: Removal from master. #7337
-
Changed the respnose structure of the percentiles aggregation #6079 (issue: #5870)
-
- Aliases
- Allocation
- Analysis
- Bulk
-
-
Remove Bulk UDP #7595
-
- CAT API
- CRUD
-
-
Remove core delete-by-query implementation #10859 (issue: #10067)
-
Remove async replication from the docs and REST spec #10162 (issue: #10114)
-
Delete api: remove broadcast delete if routing is missing when required #10136 (issue: #9123)
-
Version types
EXTERNAL&EXTERNAL_GTEtest for version equality in read operation & disallow them in the Update API #5929 (issues: #5661, #5663)
-
- Cache
- Circuit Breakers
- Cluster
- Core
-
-
Remove MergeScheduler pluggability #11585
-
Don’t allow indices containing too-old segments to be opened #11072 (issues: #10215, #11095)
-
Remove Restart API and remove Node#stop() #9921 (issue: #9841)
-
Remove component settings from AbstractComponent #9919
-
Refactor settings filtering, adding regex support #9748 (issue: #6295)
-
Cut over to Path API for file deletion #8366
-
Switch to murmurhash3 to route documents to shards. #7954
-
Resiliency: Throw exception if the JVM will corrupt data. #7580
-
- Discovery
- Engine
-
-
Remove full flush / FlushType.NEW_WRITER #9559
-
- Exceptions
- Fielddata
-
-
Fielddata: Remove soft/resident caches. #7443
-
- Highlighting
- Index APIs
- Index Templates
- Internal
-
-
Flatten SearchService and clean up build-in registration #12807 (issue: #12783)
-
Consolidate shard level abstractions #11847
-
Bake in TieredMergePolicy #11588
-
Remove Translog interface #10988
-
Remove InternalNode class and use Node directly #9844
-
Remove OperationRouting abstraction #9085
-
Drop support for state written pre 0.90 #8850
-
Remove some more bwc code #8778
-
Remove runtime version checks #8768
-
Remove NoneGateway, NoneGatewayAllocator, & NoneGatewayModule #8537
-
Simplify reading / writing from and to BlobContainer #7551
-
Refactor guice startup #7289
-
Fixed filters execution order and fix potential concurrency issue in filter chains #7023 (issues: #7019, #7021)
-
Make transport action name available in TransportAction base class #6860
-
Cleanup Rest Response #5612
-
Remove
Releasablein favor ofCloseable[ISSUE] #5423
-
- Java API
-
-
Java api: remove execution from TermsQueryBuilder as it has no effect #12884
-
Enhancement/terms lookup fixes #12870
-
Centralize admin implementations and action execution #10955
-
Automatically thread client based action listeners #10940
-
Remove redundant BytesQueryBuilder in favour of using WrapperQueryBuilder internally #10919
-
Aggregations: Clean up response API for Aggregations #9221
-
QueryBuilders cleanup (add and deprecate) #8667 (issue: #8721)
-
Remove operationThreaded setter from ExplainRequestBuilder #7186
-
Remove unnecessary intermediate interfaces #6517 (issue: #4355)
-
Remove operation threading from broadcast actions #6044
-
Remove search operation threading option #6042
-
Make Create/Update/Delete classes less mutable #5939 (issue: #5917)
-
- Logging
-
-
Truncate log messages at 10,000 characters by default #11050
-
- Mapping
-
-
Disallow type names to start with dots for new indices except for .percolator #12561 (issue: #12560)
-
Remove ability to configure _index #12356 (issues: #12027, #12329)
-
Enforce field names do not contain dot #12068
-
Restrict fields with the same name in different types to have the same core settings #11812 (issue: #8871)
-
Remove the
compress/compress_thresholdoptions of the BinaryFieldMapper #11280 -
Remove ability to set the value of meta fields inside
_source#11074 (issue: #11051) -
Validate dynamic mappings updates on the master node. #10634 (issues: #8650, #8688)
-
Remove the ability to have custom per-field postings and doc values formats. #9741 (issue: #8746)
-
Remove support for new indexes using
pathsetting in object/nested fields orindex_namein any field #9570 (issue: #6677) -
Remove type prefix support from field names in queries #9492 (issue: #8872)
-
Remove
index_analyzersetting to simplify analyzer logic #9451 (issue: #9371) -
Remove type level default analyzers #9430 (issues: #8874, #9365)
-
Remove fieldSearchAnalyzer and fieldSearchQuoteAnalyzer from MapperService. #9262
-
Remove
allow_type_wrappersetting #9185 -
Add doc values support to boolean fields. #7961 (issues: #4678, #7851)
-
Remove unsupported
postings_format/doc_values_format#7604 (issues: #7238, #7566) -
Mappings: Update mapping on master in async manner #6648
-
The binary field shouldn’t be stored by default, because it is already available in the _source #4957
-
- More Like This
- NOT CLASSIFIED
-
-
Rivers removal. #11568
-
- Network
- Packaging
- Parent/Child
- Plugins
-
-
Simplify Plugin API for constructing modules #12952
-
Use name from plugin descriptor file to determine plugin name #12775 (issue: #12715)
-
make java.version mandatory for jvm plugins #12424
-
Adapt pluginmanager to the new world #12408
-
CLITool: Port PluginManager to use CLITool #12290
-
One single (global) way to register custom query parsers #11481
-
Don’t overwrite plugin configuration when removing/upgrading plugins #7890 (issue: #5064)
-
- Query DSL
-
-
Don’t allow fuzziness specified as a % and require edits [0,2] instead #12229 (issue: #10638)
-
Remove filter parsers. #10985
-
Deprecate the
limitfilter. #10532 -
Remove
fuzzy_like_thisquery #10391 -
Function Score: Refactor RandomScoreFunction to be consistent, and return values in range [0.0, 1.0] #7446 (issue: #6907)
-
Remove
custom_scoreandcustom_boost_factorqueries #5076
-
- REST
-
-
Cluster state: return
routing_nodesonly when requested through specific flag #10486 (issue: #10412) -
Remove jsonp support and associated tests #9242 (issue: #9108)
-
Remove
statuscode from main action/#8865 -
Add all meta fields to the top level json document in search response #8131
-
Security: Disable JSONP by default #6795
-
- Recovery
- Scripting
-
-
Remove deprecated script APIs #11619
-
Add script type and script name to error messages #11449 (issue: #6653)
-
Added a new script construct #10649
-
Remove deprecated methods from ScriptService #10476
-
Remove support for
script.disable_dynamicsetting #10286 (issue: #10116) -
Cleanup ScriptService & friends in preparation for #6418 #9992 (issue: #6418)
-
Removed deprecated script parameter names #9815
-
Disable dynamic Groovy scripting by marking Groovy as not sandboxed [ISSUE] #9655
-
Created a parameter parser to standardise script options #7977
-
Script with
_score: remove dependency of DocLookup and scorer #7819 (issues: #7043, #7487) -
Remove MVEL as a built-in scripting language #6610
-
Switch to Groovy as the default scripting language #6571 (issue: #6233)
-
- Search
-
-
Cut over to the Lucene filter cache #10897
-
Remove terms filter lookup cache. #9056
-
Fix script fields to be returned as a multivalued field when they produce a list #8592 (issue: #3015)
-
Remove partial fields #8133
-
Only return aggregations on the first page with scroll and forbidden with scan #7497 (issue: #1642)
-
- Settings
-
-
Do not permit multiple settings files #13043 (issue: #13042)
-
change CORS allow origin default to allow no origins #11890 (issue: #11169)
-
Require units for time and byte-sized settings, take 2 #11437 (issues: #10888, #7616, #7633)
-
Remove
mapping.date.round_ceilsetting for date math parsing #8889 (issues: #8556, #8598) -
Add a
index.query.parse.allow_unmapped_fieldssetting that fails if queries refer to unmapped fields. #6928 (issue: #6664) -
Change default filter cache to 10% and circuit breaker to 60% #5990
-
- Shadow Replicas
- Snapshot/Restore
-
-
Url repository should respect repo.path for file urls #11687
-
Fix FSRepository location configuration #11157 (issues: #10828, #11068)
-
Remove obsolete
expand_wildcards_openandexpand_wildcards_closeoptions #10744 (issues: #10743, #6097) -
Automatic verification of all files that are being snapshotted with Snapshot/Restore #7159 (issue: #5593)
-
- Stats
-
-
Remove network stats & info #12054
-
Update fs stats #12053
-
Update OS stats #12049
-
Update process stats #12043
-
Remove sigar completely #12010
-
Removed
id_cachefrom stats and cat apis. #11183 (issue: #5269) -
Cleanup JVM info and stats #10553
-
Add human readable
start_timeandrefresh_interval#5544 (issue: #5280) -
Migrating NodesInfo API to use plugins instead of singular plugin #5072
-
- Store
- Term Vectors
-
-
More consistent naming for term vector[s] #8484
-
Deprecations
- Fielddata
-
-
Remove non-default fielddata formats. #11669
-
- Geo
- NOT CLASSIFIED
- Settings
New features
- Aggregations
-
-
Add HDRHistogram as an option in percentiles and percentile_ranks aggregations #12362 (issue: #8324)
-
Aggregations: Adds other bucket to filters aggregation #11948 (issue: #11289)
-
Aggregations: Pipeline Aggregation to filter buckets based on a script #11941
-
Adds cumulative sum aggregation #11825
-
Allow users to perform simple arithmetic operations on histogram aggregations #11601 (issue: #11029)
-
Aggregations: add serial differencing pipeline aggregation #11196 (issue: #10190)
-
Add Holt-Winters to moving_avg aggregation #11043
-
Make it possible to configure missing values. #11042 (issue: #5324)
-
Pipeline aggregations: Ability to perform computations on aggregations #10568 (issues: #10000, #10002, #9293, #9876)
-
PercentageScoreheuristic for significant_terms #9747 (issue: #9720) -
Return the sum of the doc counts of other buckets in terms aggregations. #8213
-
Significant terms: add scriptable significance heuristic #7850
-
Added pre and post offset to histogram aggregation #6980 (issue: #6605)
-
Add children aggregation #6936
-
Significant Terms: Add google normalized distance and chi square #6858
-
Infrastructure for changing easily the significance terms heuristic #6561
-
Deferred aggregations prevent combinatorial explosion #6128
-
Support bounding box aggregation on geo_shape/geo_point data types. [ISSUE] #5634
-
Add reverse nested aggregation [ISSUE] #5485
-
Cardinality aggregation [ISSUE] #5426
-
Percentiles aggregation [ISSUE] #5323
-
Significant_terms aggregation #5146
-
Add preserve original token option to ASCIIFolding #5115 (issue: #4931)
-
Add script support to value_count aggregations. #5007 (issue: #5001)
-
- Allocation
- Analysis
- CAT API
- CRUD
- Cache
-
-
Query Cache: Support shard level query response caching #7161
-
- Circuit Breakers
- Core
- Dates
- Index APIs
- Index Templates
- Indexed Scripts/Templates
- Internal
-
-
Added an option to add arbitrary headers to the client requests #7127
-
- Java API
- Logging
- Mapping
- More Like This
- Percolator
-
-
Enable percolation of nested documents #5082
-
- Plugin Delete By Query
-
-
Add delete-by-query plugin #11516
-
- Plugins
- Query DSL
-
-
Query DSL: Add
filterclauses toboolqueries. #11142 -
Add span within/containing queries. #10913
-
Add
formatsupport for date range filter and queries #7821 (issue: #7189) -
Add
min_scoreparameter to function score query to only match docs above this threshold #7814 (issue: #6952) -
Add the
field_value_factorfunction to the function_score query #5519 -
Added
cross_fieldstype to multi_match query #5005 (issue: #2959) -
Allow for executing queries based on pre-defined templates [ISSUE] #4879
-
- REST
- Recovery
- Scripting
-
-
Add Multi-Valued Field Methods to Expressions #11105
-
Add support for fine-grained settings #10116 (issues: #10274, #6418)
-
Add script engine for Lucene expressions #6819 (issue: #6818)
-
Add Groovy as a scripting language, add groovy sandboxing #6233
-
Add Groovy as a scripting language, switching default from Mvel → Groovy #6106
-
- Search
-
-
Validate API: provide more verbose explanation #10147 (issues: #1412, #88247)
-
Add inner hits to nested and parent/child queries #8153 (issues: #3022, #3152)
-
Sorting: Allow _geo_distance to handle many to many geo point distance #7097 (issue: #3926)
-
Add search-exists API to check if any matching documents exist for a given query #7026 (issue: #6995)
-
Add an option to early terminate document collection when searching/counting #6885 (issue: #6876)
-
Sequential rescores [ISSUE] #4748
-
- Search Templates
- Settings
- Shadow Replicas
- Stats
- Store
- Suggesters
- Term Vectors
- Top Hits
-
-
Add top_hits aggregation #6124
-
- Upgrade API
Enhancements
- Aggregations
-
-
Make ValueParser.DateMath aware of timezone setting #12886 (issue: #12278)
-
Fix setting timezone on default DateTime formatter #12581 (issue: #12531)
-
Aggregations: Add better validation of moving_avg model settings #12280
-
Aggregations: Adds a new GapPolicy - NONE #11951
-
Aggregations: Makes ValueFormat and ValueFormatter never null #11943 (issue: #10594)
-
Add cost minimizer to tune
moving_avgparameters #11881 -
Aggregations: moving_avg model parser should accept any numeric #11778 (issue: #11487)
-
Renaming reducers to Pipeline Aggregators #11275
-
Improve
include/excludeclause list speed and scalability #11188 (issue: #11176) -
Remove pointless term frequency lookups. #11094 (issue: #11093)
-
Rename Moving Average models to their "common" names #10964
-
Derivative Aggregation x-axis units normalisation #10898
-
Added context for significant_terms scoring #10633 (issue: #10613)
-
Removed aggregations from ReduceContext #10509
-
Format bucket key_as_string in
date_histogramaccording totime_zone#9744 (issue: #9710) -
Refactor aggregations to use lucene5-style collectors. #9544 (issues: #6477, #9098)
-
Add
offsetoption to histogram aggregation #9505 (issue: #9417) -
Unify histogram implementations #9446
-
Internal simplifications. #9097
-
Numeric metric aggregations are now formattable #9032 (issue: #6812)
-
Adds methods to get to/from as Strings for Range Aggs #9026 (issue: #9003)
-
Made the
nested,reverse_nestedandchildrenaggs ignore unmapped nested fields or unmapped child / parent types. #8808 (issue: #8760) -
Do not sort histogram buckets on shards. #8797
-
Make
sizeproperty parsing inconsistent #8645 (issue: #6061) -
Do not take deleted documents into account in aggregations filters. #8540
-
Added getProperty method to Aggregations #8421
-
Meta data support with each aggregation request/response #8279 (issue: #6465)
-
Buckets can now be serialized outside of an Aggregation #8113 (issue: #8110)
-
Support for arrays of numeric values in include/exclude clauses #7727 (issue: #7714)
-
Add ability to sort on multiple criteria #7662 (issues: #6917, #7588)
-
Encapsulate AggregationBuilder name and make getter public #7425
-
Merge LongTermsAggregator and DoubleTermsAggregator. #7279
-
Remove the logic to optionally sort/dedup values on the fly. #7276
-
Make the list of buckets for terms and histogram returned as a java.util.List. #7275
-
Stops direct subclassing of InternalNumericMetricsAggregation #7058
-
Better heuristic for setting default
shard_sizein terms aggregation #6960 (issue: #6857) -
Added an option to show the upper bound of the error for the terms aggregation #6778 (issue: #6696)
-
Extend allowed characters in aggregation name [ISSUE] #6702
-
Moved BucketsAggregator#docCounts field to IntArray #6529
-
GlobalOrdinalsStringTermsAggregator is inefficient for high-cardinality fields [ISSUE] #6518
-
Remove
ordinalsexecution hint. #6499 -
Delegation of nextReader calls #6477
-
Add
shard_min_doc_countparameter to terms aggregation #6143 (issues: #5998, #6041) -
Add
shard_min_doc_countparameter for significant terms similar toshard_size#6041 -
Add
include/excludesupport to global ordinals based terms and significant terms aggregations #6000 -
Lower the initial sizing of sub aggregations. #5994
-
Improve the way sub-aggregations are collected. #5975
-
Add global ordinal based implementations for significant terms aggregation #5970
-
Use collectExistingBucket() if a bucket already exists #5955
-
Significant_terms agg: added option for a backgroundFilter #5944
-
Improve terms aggregation to perform the segment ordinal to global ordinal lookup post segment collection #5895
-
Remove abstraction in the percentiles aggregation. #5859
-
Instantiate facets/aggregations during the QUERY phase. #5821
-
Aggregation cleanup #5699
-
Aggregations cleanup #5614
-
Refactor common code for unmapped aggregators into
NonCollectingAggregator. #5528 -
Significant_terms agg only creates term frequency cache when necessary #5459 (issue: #5450)
-
Added
extended_boundssupport for date_/histogram aggs #5444 (issue: #5224) -
Added support for sorting buckets based on sub aggregation down the current hierarchy #5340 (issue: #5253)
-
Terms aggs: only use ordinals on low-cardinality fields by default. #5304 (issue: #5303)
-
Rest API needs to be consistent across all multi-bucket aggs [ISSUE] #4926
-
- Aliases
- Allocation
-
-
Add
expectedShardSizeto ShardRouting and use it in path.data allocation #12947 (issue: #11271) -
Make RoutingNodes read-only by default #12690
-
Avoid extra reroutes of delayed shards in RoutingService #12678 (issues: #12456, #12515, #12532)
-
Reroute shards when a node goes under disk watermarks #12452 (issue: #12422)
-
No need to find replica copy when index is created #12435
-
Adapt IndicesClusterStateService to use allocation ids #12397 (issues: #12242, #12387)
-
Simplify handling of ignored unassigned shards #12339
-
Initial Refactor Gateway Allocator #12335
-
Use recently added allocation ids for shard started/failed #12299 (issue: #12242)
-
Unique allocation id #12242
-
Allow shards to be allocated if leftover shard from different index exists #12237 (issue: #10677)
-
Simplify assignToNode to only do initializing #12235
-
Simplify ShardRouting and centralize move to unassigned #11634
-
When using
recover_on_any_nodeon shared filesystem, respect Deciders #11168 -
Async fetch of shard started and store during allocation #11101 (issue: #9502)
-
Verify shards index UUID when fetching started shards #10200
-
Early terminate if the cluster can’t be rebalanced #9162
-
DiskThresholdDecider#remain(…) should take shards relocating away into account #8659 (issue: #8538)
-
Take percentage watermarks into account for reroute listener #8382 (issues: #8367, #8368)
-
Reroute shards automatically when high disk watermark is exceeded #8270 (issue: #8146)
-
Add rebalance enabled allocation decider #8190 (issue: #7288)
-
Add option to take currently relocating shards' sizes into account #7785 (issues: #6168, #7753)
-
Allow primaries that have never been allocated to be allocated if under the low watermark #6209 (issue: #6196)
-
Add explanations for all AllocationDeciders #4934 (issues: #2483, #4380)
-
Make shard balancing deterministic if weights are identical #4866
-
- Analysis
-
-
Document and test custom analyzer
position_offset_gap#10934 (issue: #1812) -
Expose Latvian analyzer #7542
-
Improve Hunspell error messages #6850
-
Share numeric date analyzer instances between mappings #6843
-
Add missing pre built analysis components #6770
-
PatternAnalyzer should use PatternTokenFilter instead [ISSUE] #6717
-
More resource efficient analysis wrapping usage #6714
-
Add additional Analyzers, Tokenizers, and TokenFilters from Lucene #6693 (issue: #5935)
-
Use non analyzed token stream optimization everywhere #6001
-
Add support for char filters in the analyze API #5148
-
- Bulk
- CAT API
-
-
Add option to
_cat/indicesto return index creation date #11524 #11688 (issue: #11524) -
Mark shadow replicas with s in _cat/shards output #10023 (issue: #9772)
-
Add file descriptor details to cat/nodes #7655 (issue: #7652)
-
Add configured thread pool sizes to _cat/thread_pool [ISSUE] #5366
-
RestTable.renderValue() doesn’t know about tera and peta [ISSUE] #4871
- CRUD
- Cache
-
-
Left over from the
query_cachetorequest_cacherename #12478 -
Give the filter cache a smaller maximum number of cached filters. #11833
-
Remove the query parser cache. #10856
-
Don’t use the fixed bitset filter cache for child nested level filters, but the regular filter cache instead #9740 (issue: #8810)
-
Use a smaller expected size when serializing query results #9485
-
Use correct number of bytes in query cache accounting #9479
-
Use a 1024 byte minimum weight for filter cache entries #8304 (issues: #8249, #8268)
-
Immediately remove filter cache entries on cache clear #8289 (issue: #8285)
-
Add hit and miss count to Query Cache #7355
-
Warmer (search) to support query cache #7326
-
Add a request level flag to control Query Cache #7167
-
Add a periodic cleanup thread for IndexFieldCache caches #7015 (issue: #7010)
-
- Circuit Breakers
-
-
Add support for registering custom circuit breaker #8795
-
Circuit Breakers: Log if CircuitBreaker is tripping #8050
-
Include name of the field that caused a circuit break in the log and exception message #5841 (issue: #5718)
-
Increase RamAccountingTermsEnum flush size from 1mb to 5mb #5335
-
Add circuit breaker for parent/child id cache #5325 (issue: #5325)
-
Add circuit breaker functionality to parent/child id field data cache [ISSUE] #5276
-
- Cluster
-
-
Remove double call to elect primaries #12147
-
Rename MetaData.uuid → MetaData.clusterUUID and IndexMetaData.uuid→ IndexMetaData.indexUUID #11914 (issue: #11831)
-
Add MetaData.uuid to ClusterState.toXContent #11832
-
Remove scheduled routing #11776
-
Reset registeredNextDelaySetting on reroute #11759
-
Add Unassigned meta data #11653
-
Remove deprecated METADATA cluster block level #10779 (issue: #9203)
-
Make
routing_nodesan independent metric option in cluster state api #10412 (issue: #10352) -
Add support for cluster state diffs #10212
-
Add METADATA_READ and METADATA_WRITE blocks #9203 (issues: #10521, #10522, #2833, #3703, #5855, #5876, #8102)
-
Don’t mark cluster health as timed out if desired state is reached #8683
-
Add missing cluster blocks handling for master operations #7763 (issue: #7740)
-
Master election should demotes nodes which try to join the cluster for the first time #7558 (issue: #7493)
-
Do not use a background thread to disconnect node which are removed from the ClusterState #7543
-
Refactored ClusterStateUpdateTask protection against execution on a non master #7511 (issue: #7493)
-
Remove unneeded cluster state serialization during cluster join #6949
-
Resend failed shard messages when receiving a cluster state still referring to the failed shards #6881
-
Send shard exists requests if shard exists locally but is not allocated to the node #6870
-
Don’t attempt to start or fail shard if no master node can be found #6841
-
Improve handling of failed primary replica handling #6825 (issue: #6808)
-
Add local node to cluster state #6811
-
During relocation, process pending mapping update in phase 2 #6762 (issue: #6648)
-
Improve pending api to include current executing class #6744
-
Start Master|Node fault detection pinging immediately during discovery #6706 (issue: #6480)
-
Clean shard bulk mapping update to only use type name #6695
-
Ensure
index.version.createdis consistent #6660 -
Refactored AckedClusterStateUpdateTask & co. to remove code repetitions in subclasses #6559
-
Wait till node is part of cluster state for join process #6480
-
Do not use versions to optimize cluster state copying for a first update from a new master #6466
-
Improve cluster update settings api #6244
-
When sending shard start/failed message due to a cluster state change, use the master indicated in the new state rather than current #6189
-
Raise node disconnected even if the transport is stopped #5918
-
Moved the updateMappingOnMaster logic into a single place. #5850 (issue: #5798)
-
A new ClusterStateStatus to indicate cluster state life cycles #5741
-
Optimize multiple cluster state processing on receiving nodes [ISSUE] #5139
-
Introduced a new IMMEDIATE priority - higher than URGENT #5098 (issue: #5062)
-
Bulk process of shard started/failed should not execute on already processed events [ISSUE] #5061
-
- Core
-
-
Improve jvmcheck error failure #12696
-
Use explict flag if index should be created on engine creation #12671
-
Move Streams.copyTo(String|Bytes)FromClasspath() into StreamsUtils #12598
-
Improve toString on EsThreadPoolExecutor #12535 (issue: #9732)
-
Carry over shard exception failure to master node #12263
-
Allow IBM J9 2.8+ in version check #11850
-
Use System.nanoTime for ThreadPool’s estimated time, since it’s less likely to go backwards #11626
-
Cleanup MergeScheduler infrastrucutre #11602
-
Reduce shard inactivity timeout to 5m #11479 (issues: #11179, #11336)
-
Fail shard if search execution uncovers corruption #11440 (issue: #11419)
-
Acquire IndexWriter’s
write.locklock before shard deletion #11127 (issue: #11097) -
Ban PathUtils.get (for now, until we fix the two remaining issues) #11069 (issues: #11065, #11068)
-
Refactor SSD/FileStore logic out of NodeEnvironment #10755 (issue: #10717)
-
Refactor TransportShardReplicationOperationAction #10749 (issue: #10032)
-
Make getFileStore a bit more defensive #10696
-
Ref count write operations on IndexShard #10610
-
Rename START phase into VERIFY_INDEX #10570
-
Refresh if many deletes in a row use up too much version map RAM #10312 (issue: #7052)
-
Add before/afterIndexShardDelete callbacks to index lifecycle #10173
-
Move GatewayShardsState logic into IndexShard #10093
-
Don’t rethrow already handled merge exceptions #10083
-
NodeEnv should lock all shards for an index #9799
-
Retry if shard deletes fail due to IOExceptions #9784
-
Only do a single listAll from FileSwitchDir #9666 (issue: #6636)
-
Consolidate index / shard deletion in IndicesService #9605
-
Increase default xlog flush size from 200mb to 512 mb #9341 (issue: #9265)
-
Pass through all exceptions in IndicesLifecycleListeners #9330
-
Pass index settings to IndicesLifecycle#beforeIndexCreated and #afterIndexShardClosed #9245
-
Delete shard content under lock #9083 (issues: #8608, #9009)
-
Remove IndexEngine #8955
-
Remove Gateway abstraction #8954
-
Use Lucene’s defaults for compound file format #8934 (issue: #8919)
-
Remove explicit .cleanUp() on cache clear #8924
-
Cleanup LocalGatewayShardsState #8852
-
Let Lucene kick off merges normally #8643
-
Cut over MetaDataStateFormat to Path API in Gateway #8609
-
Ensure shards are deleted under lock on close #8579
-
Add before/after indexDeleted callbacks to IndicesLifecycle #8569 (issue: #8551)
-
Free pending search contexts if index is closed #8551
-
Ban all usage of Future#cancel(true) #8494
-
Set bloom default to false even when Directory doesn’t have a codecService #8442
-
Introduce shard level locks to prevent concurrent shard modifications #8436
-
Observe cluster state on health request #8350
-
Remove usage of Directory#fileExists #8233
-
Introduce a RefCounted interface and basic impl #8210
-
Use 1 instead of 0 as filler version value for nested docs #8145
-
Resiliency: Perform write consistency check just before writing on the primary shard #7873
-
Add ActionRunnable support to ThreadPool to simplify async operation on bounded threadpools #7765
-
Change the default cache filter impl from FixedBitSet to WAH8DocIdSet #7577 (issues: #6280, #7037)
-
Verify checksums on merge #7360
-
Change numeric data types to use SORTED_NUMERIC docvalues type #6967
-
Disable loading of bloom filters by default #6959 (issues: #6298, #6349)
-
Don’t close/reopen IndexWriter when changing RAM buffer size #6856
-
Don’t acquire dirtyLock on autoid for create #6584
-
Reuse Lucene’s TermsEnum for faster _uid/version lookup during indexing #6298 (issue: #6212)
-
Entirely cut over to TopDocs#merge for merging shard docs in the reduce phase #6197
-
Don’t use AllTokenStream if no fields were boosted [ISSUE] #6187
-
Remove SerialMergeScheduler [ISSUE] #6120
-
Throttling incoming indexing when Lucene merges fall behind [ISSUE] #6066
-
Use Lucene built-in checksumming [ISSUE] #5924
-
Don’t lookup version for auto generated id and create #5917
-
Change default merge throttling to 50MB / sec #5902
-
Don’t lookup version for auto generated id and create #5785
-
Prevent fsync from creating 0-byte files #5746
-
Move to use serial merge schedule by default #5447
-
Force merges to not happen when indexing a doc / flush #5319
-
Reuse pages more agressively in BigArrays. #5300 (issue: #5299)
-
- Dates
- Discovery
-
-
Wait on incoming joins before electing local node as master #12161
-
Don’t join master nodes or accept join requests of old and too new nodes #11972 (issue: #11924)
-
Prevent over allocation for multicast ping request #10896
-
Unicast Ping should close temporary connections after returning ping results #10849
-
Prevent stale master nodes from sharing dated cluster states to nodes that have moved to a different master node #9632
-
Publishing timeout to log at WARN and indicate pending nodes #9551
-
Concurrent node failures can cause unneeded cluster state publishing #8933 (issue: #8804)
-
Client: Only fetch the node info during node sampling #8685
-
Improve handling of multicast binding exceptions #8243 (issue: #8225)
-
Simplify discovery node initialization if version is unknown #8055 (issue: #8051)
-
Remove MasterFaultDetection.Listener.notListedOnMaster #7995
-
Only accept unicast pings once started #7950
-
Add a finalize round to multicast pinging #7924
-
During discovery, master fault detection should fall back to cluster state thread upon error #7908 (issue: #7834)
-
Close ping handler’s executor service properly #7903
-
NodesFD: simplify concurrency control to fully rely on a single map #7889
-
During discovery, remove any local state and use clusterService.state instead #7834
-
Update ZenDiscovery fields via the cluster service update task. #7790
-
Give a unique id to each ping response #7769
-
UnicastZenPing don’t rename configure host name #7747 (issue: #7719)
-
Node join requests should be handled at lower priority than master election #7733
-
Not all master election related cluster state update task use Priority.IMMEDIATE #7718
-
Accumulated improvements to ZenDiscovery #7493 (issue: #2488)
-
UnicastZenPing should also ping last known discoNodes #7336
-
With unicast discovery, only disconnect from temporary connected nodes #6966
-
During discovery, verify connect when sending a rejoin cluster request #6779
-
Have a dedicated join timeout that is higher than ping.timeout for join #6342
-
Unicast discovery enhancement #5508
-
- Engine
-
-
Pre sync flush cleanups #11252
-
Remove flushNeeded in favor of IW#hasUncommittedChanges() #11225
-
Remove the ability to flush without flushing the translog #11193
-
Make SearchFactory static class in InternalEngine #11154
-
Remove reflection call to waitForMerges #10102
-
Always fail engine on corruption #10092
-
Move InternalEngine.segmentStats() into abstract Engine #9728 (issue: #9727)
-
Move more methods into abstract Engine #9717
-
Move as much as possible into abstract Engine #9678
-
Factor out settings updates from Engine #9625
-
Close Engine immediately if a tragic event strikes. #9616 (issue: #9517)
-
Refactor InternalEngine into abstract Engine and classes #9585
-
Remove FlushType and make resources final in InternalEngine #9565
-
Remove dirty flag and force boolean for refresh #9484
-
Simplify Engine construction and ref counting #9211
-
Fold engine into IndexShard #9181
-
Don’t acquire Engine’s readLock in segmentsStats #8910 (issue: #8905)
-
Remove engine related command classes #8900
-
Allow InternalEngine to be stopped and started #8784 (issue: #8720)
-
Flush IndexWriter to disk on close and shutdown #7563
-
Ensure close is called under lock in the case of an engine failure #5800
-
Fail the engine/shard when refresh failed #5633
-
- Exceptions
-
-
Improve startup exceptions (especially file permissions etc) #13050
-
Fix formatting of startup/configuration errors #13029
-
Add serialization support for InterruptedException #12981
-
Include stacktrace in rendered exceptions #12260 (issue: #12239)
-
Render structured exceptions in mget / mpercolate #12240
-
Add index name to the upgrade exception #12213
-
Promote headers to first class citizens on exceptions #12006
-
Parameterized exception messages #11981
-
Carry on rest status if exceptions are not serializable #11973
-
Render strucutred exception in multi search #11849
-
Reduce the size of the XContent parsing exception #11642
-
Remove ElasticsearchIAE and ElasticsearchISE #10862 (issue: #10794)
-
Improve exception handling in transport local execution #10554
-
Fix typo when primary is not available to index a document (UnavailableShardsException) #10140
-
Change IndexPrimaryShardNotAllocatedException from 409 to 500 #7987 (issue: #7632)
-
Nest original exception while creating NoShardAvailableActionException #7757 (issue: #7756)
-
Improve exception from Store.failIfCorrupted #7695 (issue: #7596)
-
Introduced a new elasticsearch exception family that can hold headers #7269
-
Better message for invalid internal transport message format #6916
-
Function Score: Add missing whitespace in error message when throwing exception #6155
-
- Fielddata
-
-
Consult field info before fetching field data #12403
-
Remove the dependecy on IndexFielddataService from MapperService. #12371
-
Enable doc values by default, when appropriate #10209 (issue: #8312)
-
Change threshold value of fielddata.filter.frequency.max/min #9522 (issue: #9327)
-
Fielddata: Remove custom comparators and use Lucene’s instead #6981 (issue: #5980)
-
Switch fielddata to use Lucene doc values APIs. #6908
-
Make BytesValues.WithOrdinals more similar to Lucene’s SortedSetDocValues #6524
-
Don’t expose hashes in Fielddata anymore. #6500
-
Add a dedicated field data type for the _index field mapper. #6073 (issue: #5848)
-
Provide meaningful error message if field has no fielddata type #5979 (issue: #5930)
-
Use segment ordinals as global ordinals if possible #5873 (issue: #5854)
-
Make use of global ordinals in parent/child queries #5846
-
Added a AppendingDeltaPackedLongBuffer-based storage format to single value field data #5706
-
Remove AtomicFieldData.isValuesOrdered. #5688
-
Add global ordinals #5672
-
Moved the decision to load p/c fielddata eagerly to a better place. #5569
-
- Geo
-
-
Update ShapeBuilder and GeoPolygonQueryParser to accept non-closed GeoJSON #11161 (issue: #11131)
-
Remove local Lucene Spatial package #10966
-
Add merge conflicts to GeoShapeFieldMapper #10533 (issues: #10513, #10514)
-
Coordinates can contain more than two elements (x,y) in GeoJSON parser. #9542 (issue: #9540)
-
Revert "[GEO] Update GeoPolygonFilter to handle ambiguous polygons" #9463 (issues: #5968, #9304, #9339, #9462)
-
Update GeoPolygonFilter to handle polygons crossing the dateline #9339 (issues: #5968, #8672, #9304)
-
GeoPolygonFilter not properly handling dateline and pole crossing #9171 (issue: #5968)
-
Removing unnecessary orientation enumerators #9036 (issues: #8978, #9035)
-
Add optional left/right parameter to GeoJSON #8978 (issue: #8764)
-
Feature/Fix for OGC compliant polygons failing with ambiguity #8762 (issue: #8672)
-
Fixes BoundingBox across complete longitudinal range #7340 (issue: #5218)
-
Adds support for GeoJSON GeometryCollection #7123 (issue: #2796)
-
Added caching support to geohash_filter #6478
-
Allow to parse lat/lon as strings and coerce them #5626
-
Improve error detection in geo_filter parsing #5371 (issue: #5370)
-
Improve geo distance accuracy [ISSUE] #5192
-
Add support for distances in nautical miles #5088 (issue: #5085)
-
- Highlighting
- Index APIs
-
-
Show human readable Elasticsearch version that created index and date when index was created #11509 (issue: #11484)
-
Add check to MetaData#concreteIndices to prevent NPE #10342 (issue: #10339)
-
Set maximum index name length to 255 bytes #8158 (issue: #8079)
-
Add
wait_if_ongoingoption to _flush requests #6996 -
Unified MetaData#concreteIndices methods into a single method that accepts indices (or aliases) and indices options #6169
-
Fix detection of unsupported fields with validate API #5782 (issue: #5685)
-
- Index Templates
- Indexed Scripts/Templates
-
-
Indexed scripts/templates: return response body when script is not found #10396 (issue: #7325)
-
Make sure headers are handed over to internal requests and streamline versioning support #7569
-
Use preference("_local") on get calls. #7477
-
Indexed Scripts/Templates: Return error message on 404 #7335 (issue: #7325)
-
- Inner Hits
- Internal
-
-
Drop commons-lang dependency #12972
-
Flatten ClusterModule and add more tests #12916 (issue: #12783)
-
Allow a plugin to supply its own query cache implementation #12881
-
Remove Environment.resolveConfig #12872
-
Remove ClassLoader from Settings #12868
-
Transport: allow to de-serialize arbitrary objects given their name #12571
-
Add RealtimeRequest marker interface to group realtime operations together #12537
-
Remove unused QueryParseContext argument in MappedFieldType#rangeQuery() #12417
-
Simplify Replica Allocator #12401
-
Replace primaryPostAllocated flag and use UnassignedInfo #12374
-
Add the ability to wrap an index searcher. #12364
-
Cleanup TransportSingleShardAction and TransportInstanceSingleOperationAction #12361
-
Remove TransportSingleCustomOperationAction in favour of TransportSingleShardAction #12350
-
updated the elasticsearch versioning format #12210
-
Cleanup the data structures used in MetaData class for alias and index lookups #12202
-
Make 2.0.0.beta1-SNAPSHOT the current version. #12151 (issue: #12148)
-
Remove mapper references from Engines #12130
-
Cleanup ShardRoutingState uses and hide implementation details of ClusterInfo #12126
-
Consolidate ShardRouting construction #12125
-
Change JarHell to operate on Path instead of URL #12109
-
Refactor MetaData to split off the concrete index name logic to a dedicated service #12058
-
really ban exitVM with security policy #11982
-
Cut over to writeable for TransportAddress #11949
-
Internal: make sure ParseField is always used in combination with parse flags #11859
-
Remove XContentParser.map[Ordered]AndClose(). #11846
-
Remove reroute with no reassign #11804
-
Use abstract runnable in scheduled ping #11795
-
Mark store as corrupted instead of deleting state file on engine failure #11769
-
Add DateTime ctors without timezone to forbidden APIs #11743
-
Fold ShardGetService creation away from Guice into IndexShard #11606
-
Create ShardSuggestService/Metrics manually outside of guice #11605
-
Minimize the usage of guava classes in interfaces, return types, arguments #11501
-
Make CompressedXContent.equals fast again. #11428 (issue: #11247)
-
Consolidate shard level modules without logic into IndexShardModule #11416
-
Serialization: Remove old version checks #11397
-
Catch UnsatisfiedLinkError on JNA load #11385
-
Deduplicate field names returned by simpleMatchToFullName & simpleMatchToIndexNames in FieldMappersLookup #11377 (issue: #10916)
-
Rename TransportShardReplicationOperationAction to TransportReplicationAction #11332
-
Absorb ImmutableSettings into Settings #11321 (issue: #7633)
-
Make some sec mgr / bootup classes package private and final. #11312
-
Tighten up our compression framework. #11279
-
Uid#createTypeUids to accept a collection of ids rather than a list #11263
-
Remove need for generics in ContextAndHeaderHolder #11222
-
Remove dependency on hppc:esoteric. #11144
-
Improve path mgmt on init, better error messages, symlink support #11106
-
Ensure JNA is fully loaded when its available, but don’t fail its not #10989
-
Transport: read/writeGenericValue to support BytesRef #10878
-
Remove Preconditions class #10873
-
Remove index/indices replication infra code #10861
-
Wait forever (or one day) for indices to close #10833 (issue: #10680)
-
Reduce code duplication in TransportIndexAction/TransportShardBulkAction. #10819
-
Don’t create a new BigArrays instance for every call of
withCircuitBreaking#10800 (issue: #10798) -
Change BigArrays to not extend AbstractComponent #10798
-
Use Tuple only as return value in Bootstrap #10784
-
CommitStats doesn’t need to allow for null values in commit user data #10774 (issue: #10687)
-
Prevent injection of unannotated dynamic settings #10763 (issue: #10614)
-
Refactor and cleanup transport request handling #10730
-
Add
fairnessoption to KeyedLock. #10703 -
Cleanup local code transport execution #10582
-
Make APIs work per-segment like Lucene’s Collector. #10389
-
Fix string comparisons #10204
-
Remove unsafe field in BytesStreamInput #10157
-
Make assert less strict to ensure local node is not null #10076
-
Use provided cluster state for indices service validations #10014
-
Fix Java 8 _ variable warning #10013
-
Stop passing default cluster name to cluster state read operations #9888
-
Add missing @Override annotations. #9832
-
Some more simple fs cleanups. #9827
-
Fix errors reported by error-prone #9817
-
Remove redundant fs metadata ops. #9807
-
Remove XCollector. #9677
-
Introduce TimedPrioritizedRunnable base class to all commands that go into InternalClusterService.updateTasksExecutor #9671 (issues: #8077, #9354)
-
Search: Reuse Lucene’s MultiCollector. #9549
-
Add
beforeIndexAddedToClustercallback #9514 -
Remove HandlesStreamInput/Output #9486
-
Add AliasesRequest interface to mark requests that manage aliases #9460
-
ClusterInfoService should wipe local cache upon unknown exceptions #9449
-
Minor hygiene, Removed Redundant inheritance #9427
-
Clean up memory reuse a bit. #9272
-
Remove includeExisting flag from adding ObjectMapper and FieldMapper listeners #9184
-
Reduce the size of the search thread pool. #9165 (issue: #9135)
-
Assert that we do not call blocking code from transport threads #9164
-
Remove reduced stack size and use JVM default instead #9158 (issue: #9135)
-
Remove IndexCloseListener & Store.OnCloseListener #9009 (issues: #8436, #8608)
-
Remove circular dependency between IndicesService and IndicesStore #8918
-
Remove some Internal* abstractions #8904
-
Add File.java to forbidden APIs #8666
-
Inverse DocIdSets' heuristic to find out fast DocIdSets. #8380
-
Temporarily ban buggy IOUtils methods with forbidden #8375
-
Refactor shard recovery from anonymous class to ShardRecoveryHandler #8363
-
Make indexQueryParserService available from ParseContext #8252 (issue: #8248)
-
Allow to configure custom thread pools #8247
-
Expose concurrency_level setting on all caches #8112 (issue: #7836)
-
Resiliency: Be more conservative if index.version.created is not set #8018
-
Split internal fetch request used within scroll and search #7870 (issues: #6933, #7319, #7856)
-
Never send requests after transport service is stopped #7862
-
Split internal free context request used after scroll and search #7856 (issues: #6933, #7319)
-
Clarify when a shard search request gets created to be only used locally #7855
-
Add a listener thread pool #7837
-
Remove unused ForceSyncDirectory #7804
-
Force execution of delete index requests #7799
-
Check if from + size don’t cause overflow and fail with a better error #7778 (issue: #7774)
-
Make sure that internally generated percolate request re-uses the original headers and request context #7767
-
Make sure that update internal requests share the same original headers and request context #7766
-
Make sure that all delete mapping internal requests share the same original headers and context #7736
-
Added scrollId/s setters to the different scroll requests/responses #7722
-
Make sure that original headers are used when executing search as part of put warmer #7711
-
Refactor copy headers mechanism to not require a client factory #7675 (issue: #7594)
-
In thread pools, use DirectExecutor instead of deprecated API #7636
-
Change LZFCompressedStreamOutput to use buffer recycler when allocating encoder #7613
-
Introduced a transient context to the rest request #7610
-
Refactor copy headers mechanism in REST API #7594 (issue: #6513)
-
Deduplicate useful headers that get copied from REST to transport layer #7590
-
Remove DocSetCache. #7582
-
Extract a common base class for (Master|Nodes)FaultDetection #7512 (issue: #7493)
-
Removing useless methods and method parameters from ObjectMapper.java and TypeParsers.java #7474 (issue: #7271)
-
Extended ActionFilter to also enable filtering the response side #7465
-
Move index templates api back to indices category and make put template and create index implement IndicesRequest #7378
-
Make sure that multi_search request hands over its context and headers to its corresponding search requests #7374
-
Make sure that multi_percolate request hands over its context and headers to its corresponding shard requests #7371
-
Clarify XContentParser/Builder interface for binary vs. utf8 values #7367
-
Remove CacheRecycler. #7366
-
Get request while percolating existing documents to keep around headers and context of the original percolate request #7333
-
Auto create index to keep around headers and context of the request that caused it #7331
-
Switch to fixed thread pool by default for management threads #7320 (issue: #7318)
-
Make sure that all shard level requests hold the original indices #7319
-
Refactored TransportMessage context #7303
-
Made it possible to disable the main transport handler in TransportShardSingleOperationAction #7285
-
Adjusted BroadcastShardOperationResponse subclasses visibility #7255
-
Add some @Nullable annotations and fix related compilation warnings. #7251
-
Adjusted visibility for BroadcastShardOperationRequest subclasses and their constructors #7235
-
Changed every single index operation to not replace the index within the original request #7223
-
Adjusted TermVectorRequest serialization to not serialize and de-serialize the index twice #7221
-
Refactored TransportSingleCustomOperationAction, subclasses and requests #7214
-
Removed needless serialization code from TransportIndexReplicationAction and corresponding request object #7211
-
Added transient header support for TransportMessage #7187
-
Check for null references that may be returned due to concurrent changes or inconsistent cluster state #7181
-
Better categorization for transport actions #7105
-
Added a cli infrastructure #7094
-
Introduced the notion of a FixedBitSetFilter that guarantees to produce a FixedBitSet #7037 (issue: #7031)
-
Remove use of recycled set in filters eviction #7012
-
Refactor TransportActions #6989
-
Expose the indices names in every action relates to if applicable #6933
-
Rename FieldMapper.termsFilter to fieldDataTermsFilter. #6888
-
Make XContentBuilder Releasable #6869
-
Remove (mostly) unused failure member from ShardSearchFailure. #6861 (issue: #6837)
-
Use KeyedLock in IndexFieldDataService #6855
-
Cleanup of the transport request/response messages #6834
-
Don’t replace indices within ActionRequest and check blocks against concrete indices #6777 (issues: #1, #2)
-
Separate parsing implementation from setter in SearchParseElement #6758 (issue: #3602)
-
Remove intern calls on FieldMapper#Names for better performance #6747
-
Disable explicit GC by default #6637
-
Make sure we don’t reuse arrays when sending an error back #6631
-
Wrap RateLimiter rather than copy RateLimitedIndexOutput #6625
-
Re-shade MVEL as a dependency #6570
-
Copy the headers from REST requests to the corresponding TransportRequest(s) #6513 (issue: #6464)
-
Better default size for global index → alias map #6504
-
Use ConcurrentHashMapV8 for lower memory overhead #6400
-
Made base64 decode parsing to detect more errors #6348 (issue: #6334)
-
Change the default type of the page recycler to CONCURRENT instead of SOFT_CONCURRENT. #6320
-
Some minor cleanups #6210
-
Remove SoftReferences from StreamInput/StreamOutput #6208
-
Use t-digest as a dependency. #6142
-
Add support for Byte and BytesRef to the XContentBuilder #6127
-
Remove unused dump infra #6060
-
Made it mandatory to specify IndicesOptions when calling MetaData#concreteIndices #6059
-
Limit the number of bytes that can be allocated to process requests. #6050
-
Fix code typo in FieldSortBuilder.java #5937
-
Improved bloom filter hashing #5901
-
Field data diet. #5874
-
Cleanup FileSystemUtils #5806
-
Make writePrimitive*() and readPrimitive*() methods public. #5710
-
LongHash add/key not consistent [ISSUE] #5693
-
Releasable bytes output + use in transport / translog #5691
-
Make Releasable extend AutoCloseable. #5689
-
Replaces usage of
StringBufferwithStringBuilder#5606 (issue: #5605) -
Internally manipulate the terms execution hint as an enum instead of a String. #5530
-
Let ByteArray/BigByteArray.get() indicate whether a byte[] was materialized. #5529
-
BytesReference.Helper should never materialize a byte[] array. #5517
-
Clean the query parse context after usage #5475
-
BytesReference usage to properly work when hasArray is not available #5455
-
MulticastChannel returned wrong channel in shared mode #5441
-
New class PagedBytesReference: BytesReference over pages #5427 (issue: #5420)
-
Rewrite BytesStreamOutput on top of BigArrays/ByteArray. #5331 (issue: #5159)
-
Add tracking of allocated arrays. #5264
-
Remove thread local recycler #5254
-
Recycler: better lifecycle control for pooled instances #5217 (issue: #5214)
-
Remove useless URL instantiation #5206
-
Variable renamings to reduce unnecessary variable naming diversity #5075
-
Add RamUsageEstimator#sizeOf(Object) to forbidden APIs #4975
-
Remove redundant version checks in transport serialisation #4731
- Java API
-
-
PrefixQueryParser takes a String as value like its Builder #12204 (issue: #12032)
-
Fix FuzzyQuery to properly handle Object, number, dates or String. #12020 (issue: #11865)
-
Treat path object as a simple value instead of Iterable in XContentBuilder #11903 (issue: #11771)
-
IdsQueryBuilder: Allow to add a list in addition to array #11409 (issue: #5089)
-
Fix typed parameters in IndexRequestBuilder and CreateIndexRequestBuilder #11382 (issue: #10825)
-
Unify SearchResponse and BroadcastOperationResponse code around shards header #11064
-
Remove duplicated buildAsBytes and corresponding toString methods #11063
-
Remove duplicated consistency level and replication type setters #10188
-
Package private getters to become public if there have corresponding public setters #9273
-
Add internal liveness action to transport client #8763
-
Added utility method #8594
-
Enabled overriding the request headers in the clients #8258
-
Adding setters or making them public in ActionRequests #8123 (issue: #8122)
-
Add indices setter to IndicesRequest interface #7734
-
Mark transport client as such when instantiating #7552
-
Allow nullable queryBuilder in FilteredQueryBuilder to match rest api #7398 (issue: #7365)
-
Some PercolateRequest "setters" allow for method chaining, some don’t [ISSUE] #7294
-
Throw IllegalStateException if you try to .addMapping for same type more than once #7243 (issue: #7231)
-
XContentBuilder.map(Map) method modified to use a wildcard for value’s type. #7212
-
Add suggestRequest to Requests and fix broken javadocs in client #7207 (issue: #7206)
-
Add index, type and id to ExplainResponse #7201
-
Add a blocking variant of close() method to BulkProcessor #6586 (issues: #4158, #6314)
-
Client intermediate interfaces removal follow-up #6563 (issue: #6517)
-
TransportClient: Improve logging, fix minor issue #6376
-
Add BoolFilterBuilder#hasClauses to be consitent with BoolQueryBuilder #5476 (issue: #5472)
-
Allow iteration over MultiGetRequest#Item instances #5470 (issue: #3061)
-
Java API does not have a way to set global highlighting settings [ISSUE] #5281
-
- Logging
-
-
Adds a setting to control source output in indexing slowlog #12806 (issue: #4485)
-
Add more debugging information to the Awareness Decider #12490 (issue: #12431)
-
Add shadow indicator when using shadow replicas #12399
-
Log warn message if leftover shard is detected #11826
-
Add -XX:+PrintGCDateStamps when using GC Logs #11735 (issue: #11733)
-
ClusterStateObserver should log on trace on timeout #11722
-
Better error messages when mlockall fails #11433
-
Display low disk watermark to be consistent with documentation #11313 (issue: #10588)
-
Add index name to log statements when settings update fails #11124
-
Add logging of slow cluster state tasks #10907 (issue: #10874)
-
Logging: add the ability to specify an alternate logging configuration location #10852 (issues: #2044, #7395)
-
Log sending translog operation batches to nodes #10544
-
Log only a summary line of filesystem detail for all path.data on node startup #10527 (issue: #10502)
-
Add INFO logging saying whether each path.data is on an SSD #10502
-
Use static logger name in Engine.java #10497
-
Miscellaneous additional logging and cleanups #10376
-
Fix logging a RoutingNode object, log an object with a good .toString instead #9863
-
Logging: improve logging messages added in #9562 #9603 (issue: #9562)
-
Change logging to warning to match pattern #9593
-
Add logging around gateway shard allocation #9562
-
Added a simple request tracer, logging incoming and outgoing Transport requests #9286
-
Reduce apache (cloud-aws) logging when rootLogger is DEBUG #8856
-
Clarify index removal log message #8641
-
Log how long IW.rollback took, and when MockFSDir starts its check index #8388
-
Change log level for mpercolate #8306
-
Suppress long mapping logging during mapping updates (unless in TRACE) #7949
-
Boostrap: Log startup exception to console if needed and to file as error #6581
-
Log script change/add and removal at INFO level #6104
-
Include thread name when logging IndexWriter’s infoStream messages #5973
-
Tie in IndexWriter’s infoStream output to "lucene.iw" logger with level=TRACE #5934 (issue: #5891)
-
Be less verbose logging ClusterInfoUpdateJob failures #5222
-
Add EnhancedPatternLayout to logging.yml options #4991
-
- Mapping
-
-
Move the
_sizemapper to a plugin. #12582 -
Remove index name from mapping parser #12352
-
Remove AbstractFieldMapper #12089
-
Completely move doc values and fielddata settings to field types #12014
-
Move short name access out of field type #11977
-
Rename "root" mappers to "metadata" mappers #11962
-
Mappings: Remove close() from Mapper #11863
-
Move merge simulation of fieldtype settings to fieldtype method #11783 (issue: #8871)
-
Hide more fieldType access and cleanup null_value merging #11770 (issue: #11764)
-
Replace fieldType access in mappers with getter #11764
-
Remove SmartNameObjectMapper #11686
-
Add equals/hashcode to fieldtypes #11644
-
Shortcut
existsandmissingqueries when no types/docs exist #11586 -
Remove leftover sugar methods from FieldMapper #11565
-
Make index level mapping apis use MappedFieldType #11559
-
Move null value handling into MappedFieldType #11544
-
Refactor core index/query time properties into FieldType #11422 (issue: #8871)
-
Validate parsed document does not have trailing garbage that is invalid json #11414 (issue: #2315)
-
Remove generics from FieldMapper #11292
-
Cleanup field name handling #11272
-
Remove document parse listener #11243
-
Remove SmartNameFieldMappers #11216
-
Make DocumentMapper.refreshSource() private. #11209
-
Make mapping updates atomic wrt document parsing. #11205
-
Remove the
ignore_conflictsoption. #11203 -
Add back support for
enabled/includes/excludesin_sourcefield #11171 (issue: #11116) -
Make FieldNameAnalyzer less lenient. #11141
-
Remove mapper listeners #11045
-
Remove traverse functions from Mapper #11027
-
Consolidate document parsing logic #10802
-
Wait for required mappings to be available on the replica before indexing. #10786
-
Join MergeResults with MergeContext since they are almost the same #10765
-
Restrict murmur3 field type to sane options #10738 (issue: #10465)
-
Remove dead code after previous refactorings #10666 (issue: #8877)
-
Same code path for dynamic mappings updates and updates coming from the API. #10593 (issues: #8688, #9364, #9851)
-
Add
enabledflag for_field_namesto replace disabling throughindex:no#9893 -
Fix field mappers to always pass through index settings #9780
-
Add
ignore_missingoption totimestamp#9104 (issues: #8882, #9049) -
Include currentFieldName into ObjectMapper errors #9020
-
Store
_timestampby default. #8139 -
Make lookup structures immutable. #7486
-
Report conflict when merging
_allfield mapping and throw exception when doc_values specified #7377 (issue: #777) -
Enforce non-null settings. #7032
-
Control whether MapperService docMapper iterator should contain DEFAULT_MAPPING #6793
-
Call callback on actual mapping processed #6748
-
Improve performance for many new fields introduction in mapping #6707
-
Better logic on sending mapping update new type introduction #6669
-
Wait for mapping updates during local recovery #6666
-
Check if root mapping is actually valid #6093 (issues: #4483, #5864)
-
Support empty properties array in mappings #6006 (issue: #5887)
-
Update default precision step, modulo tests #5908 (issue: #5905)
-
Support externalValue() in mappers #4986
-
Norms disabling on existing fields [ISSUE] #4813
-
- More Like This
-
-
Renamed
ignore_liketounlike#11117 -
Lenient default parameters #9412
-
Remove MLT Field Query #8238
-
Support for when all fields are deprecated #8067
-
Add versatile like parameter #8039
-
Replace
percent_terms_to_matchwithminimum_should_match#7898 -
Default to all possible fields for items #7382
-
Switch to using the multi-termvectors API #7014
-
Fetch text from all possible fields if none are specified #6740
-
Ensure selection of best terms is indeed O(n) #6657
-
Create only one MLT query per field for all queried items #6404
-
Add the ability to specify the analyzer used for each Field #6329
-
Added syntax for single item specification. #6311
-
Values of a multi-value fields are compared at the same level #6310
-
Replaced
excludewithincludeto avoid double negation #6248 -
Allow for both
like_textanddocs/idsto be specified. #6246 -
Added the ability to include the queried document for More Like This API. #6067
-
Fix behavior on default boost factor for More Like This. #6021
-
Added searching for multiple similar documents #5857 (issue: #4075)
-
- NOT CLASSIFIED
- Nested Docs
- Network
-
-
Don’t print lots of noise on IPv4 only hosts. #13026
-
Remove support for address resolving in InetSocketTransportAddress #13020 (issue: #13014)
-
Log network configuration at debug level #12979
-
Use preferIPv6Addresses for sort order, not preferIPv4Stack #12951
-
Make sure messages are fully read even in case of EOS markers. #11768 (issue: #11748)
-
Default value for socket reuse should not be null #11255
-
Make Netty exceptionCaught method protected #10464
-
Remove content thread safe from REST layer #10429
-
Add getter for channel in NettyTransportChannel #10319
-
Schedule transport ping interval #10189
-
Return useful error message on potential HTTP connect to Transport port #10108 (issue: #2139)
-
Change access modifiers to protected in Netty HTTP Transport #9724
-
Add profiles to Netty transport infos #9134
-
Support binding on multiple host/port pairs #8098
-
Chunk direct buffer usage by networking layer #7811
-
Make sure channel closing never happens on i/o thread #7726
-
Support "default" for tcpNoDelay and tcpKeepAlive #7136 (issue: #7115)
-
Refactoring to make MessageChannelHandler extensible #6915 (issue: #6889)
-
Refactoring to make Netty MessageChannelHandler extensible #6889
-
Improve large bytes request handling by detecting content composite buffer #6756
-
Use loopback when localhost is not resolved #5719
-
- Packaging
-
-
Bats testing: Remove useless systemctl check #12724 (issue: #12682)
-
improve sanity of securitymanager file permissions #12609
-
Do not kill process on service shutdown #12298 (issue: #11248)
-
fail plugins on version mismatch #12221
-
Allow use of bouncycastle #12102
-
Give a better exception when a jar contains same classfile twice. #12093
-
Don’t jarhell check system jars #11979
-
detect jar hell before installing a plugin #11963 (issue: #11946)
-
jar hell check should fail, if jars require higher java version #11936
-
Load plugins into classpath in bootstrap #11918 (issue: #11917)
-
steps to remove dangerous security permissions #11898
-
Packaging: Add LICENSE and NOTICE files for all core dependencies #11705 (issues: #10684, #2794)
-
Export hostname as environment variable for plugin manager #11399 (issues: #10902, #9474)
-
Use our provided JNA library, versus one installed on the system #11163
-
Remove unnecessary permissions. #11132
-
Tighten up script security more #10999
-
Add pid file to Environment #10986
-
Bail if ES is run as root #10970
-
Remove exitVM permissions #10963
-
Remove JNI permissions, improve JNI testing. #10962
-
Remove shutdownHooks permission #10953
-
Simplify securitymanager init #10936
-
Exclude jackson-databind dependency #10924
-
Remove reflection permission for sun.management. #10848 (issue: #10553)
-
Security manager cleanups #10844
-
Add common SystemD file for RPM/DEB package #10725
-
Enable securitymanager #10717
-
Remove working directory #10672
-
Standardization of packages structure and install #10595 (issue: #10330)
-
Add properties files to configure startup and installation scripts #10330
-
Use direct mapping call in Kernel32Library #9923 (issue: #9802)
-
service.bat file should explicitly use the Windows find command. #9532
-
CliTool: Add command to warn on permission/owner change #9508
-
Export the hostname as environment variable #9474 (issue: #8470)
-
Windows: makes elasticsearch.bat more friendly to automated processes #9160 (issue: #8913)
-
Shutdown: Add support for Ctrl-Close event on Windows platforms to grace… #8993
-
Packaging: Add java7/8 java-package paths to init script #8815 (issue: #7383)
-
Check if proc file exists before calling sysctl #8793 (issue: #4978)
-
Factor out PID file creation and add tests #8775 (issue: #8771)
-
deb: add systemd service config for upcoming Jessie #8765 (issue: #8493)
-
bin/elasticsearch: add help, fix endless loop #8729 (issues: #2168, #7104)
-
Allow configuration of the GC log file via an environment variable #8479 (issues: #8471, #8479)
-
Introduce elasticsearch.in.bat (i.e. es.in for Windows) #8244 (issue: #8237)
-
Make .zip and .tar.gz release artifacts contain same files #7578 (issue: #2793)
-
Add default oracle jdk 7 (x64) path to JDK_DIRS #7132
-
Prevent init script from returning when the service isn’t actually started #6909
-
Windows: Modify command window title (windows) #6752 (issue: #6336)
-
Remove java-6 directories from debian init script #6350
-
Reset locale to C in bin/elasticsearch #6047
-
Remove spaces from commented config lines in elasticsearch.yml and logging.yml [ISSUE] #5842
-
Use the new command line syntax in the init script #5033
-
Startup: Add ES_HOME to ES_INCLUDE search path #4958
-
Mark lucene-expression as provided in pom.xml #4861 (issues: #4858, #4859)
-
Move systemd files from /etc to /usr/lib #4029
-
- Parent/Child
-
-
Enforce _parent field resolution to be strict #9521 (issue: #9461)
-
Reduce memory usage in top children query #8165
-
Adding
minscore mode to parent-child queries #7771 (issue: #7603) -
Support
min_childrenandmax_childrenonhas_childquery/filter [ISSUE] #6019 -
Fix P/C assertions for rewrite reader #5731
-
Migrated p/c queries from id cache to field data. #4878 (issue: #4930)
-
- Percolator
-
-
Don’t cache percolator query on loading percolators #12862
-
Change
percolator.getTime→percolator.time#11954 -
The query parse context should be fetched from the IndexQueryParseService #11929
-
Introduce index option named index.percolator.map_unmapped_fields_as_string #9054 (issues: #9025, #9053)
-
Remove
index.percolator.allow_unmapped_fieldssetting. #8439 -
Percolator should cache index field data instances. #7081 (issue: #6806)
-
Reuse IndexFieldData instances between percolator queries #6845 (issue: #6806)
-
Add MemoryIndex reuse when percolating doc with nested type #5332
-
- Plugin Cloud AWS
- Plugin Cloud GCE
-
-
Update to GCE API v1-rev71-1.20.0 [ISSUE] #12835
-
- Plugins
-
-
Lucene SPI support for plugins. #13051
-
Ensure additionalSettings() do not conflict #12967
-
Validate checksums for plugins if available #12888 (issue: #12750)
-
Expose zen ElectMasterService as a Discovery extension point #12828
-
Introduce a formal ExtensionPoint class to stream line extensions #12826
-
Flatten Allocation modules and add back ability to plugin ShardsAllocators #12818 (issue: #12781)
-
Apply additional plugin settings only if settings are not explicit #12796
-
PluginManager: Do not try other URLs if specific URL was passed #12766
-
PluginManager: Fix elastic.co download URLs, add snapshot ones #12641 (issue: #12632)
-
Fix plugin script to allow spaces in ES_HOME #12610 (issue: #12504)
-
PluginManager: Add Support for basic auth #12445
-
Ensure logging configuration is loaded in plugin manager #12081 (issue: #12064)
-
Simplify Plugin Manager for official plugins #11805
-
Allow security rule for advanced SSL configutation #11751
-
Use of default CONF_DIR/CONF_FILE in plugin install #10721 (issues: #10673, #7946)
-
Always send current ES version when downloading plugins #10131
-
FileSystemUtils: Only create backup copies if files differ #9592
-
Add executable flag to every file in bin/ after install #7177
-
Lucene version checker should use
Lucene.parseVersionLenient#7056 -
Introduced pluggable filter chain to be able to filter transport actions execution #6921
-
Enables plugins to define default logging configuration for their needs. #6805 (issue: #6802)
-
bin/plugintests for missing plugin name when passing--url#6013 (issues: #5976, #5977) -
Check plugin Lucene version #4984
-
Serving _site plugins do not pick up on index.html for sub directories #4850 (issue: #4845)
-
- Query DSL
-
-
Remove attemped (not working) support for array in not query parser #12890
-
simple query string: remove (not working) support for alternate formats #12798 (issue: #12794)
-
RegexpQueryParser takes a String as value like its Builder #12200 (issue: #11896)
-
Expose Lucene’s new TopTermsBlendedFreqScoringRewrite. #12129
-
Special case the
_indexfield in queries #12027 (issue: #3316) -
Add support for query boost to SimpleQueryStringBuilder. #11696 (issue: #11274)
-
Change geo filters into queries #11137
-
Make the script filter a query. #11126
-
Function score: Add
defaulttofield_value_factor#10845 (issue: #10841) -
Return positions of parse errors found in JSON #10837 (issue: #3303)
-
Enable Lucene ranking behaviour for numeric term queries #10790 (issue: #10628)
-
Add support for
minimum_should_matchtosimple_query_string#9864 (issue: #6449) -
Raise an exception on an array of values being sent as the factor for a
field_value_factorquery #9246 (issue: #7408) -
function_score: use query and filter together #8675 (issue: #8638)
-
Add option for analyzing wildcard/prefix to simple_query_strinq #8422 (issue: #787)
-
Expose
max_determinized_statesin regexp query, filter #8384 (issue: #8357) -
FunctionScore: RandomScoreFunction now accepts long, as well a strings. #8311 (issue: #8267)
-
Function Score: Add optional weight parameter per function #7137 (issue: #6955)
-
Add time zone setting for relative date math in range filter/query #7113 (issue: #3729)
-
Add support for the
_nameparameter to thesimple_query_stringquery #6979 -
Function score parser should throw exception if both
functions:[]and singlefunctiongiven #5995 -
Refactor SimpleQueryParser settings into separate Settings class, add "lenient" option #5208 (issue: #5011)
-
Throw parsing exception if terms filter or query has more than one field #5137 (issue: #5014)
-
Add "locale" parameter to query_string and simple_query_string #5131 (issue: #5128)
-
Add support for
lowercase_expanded_termsflag to simple_query_string #5126 (issue: #5008) -
Add fuzzy/slop support to
simple_query_string#4985 -
Range filter no cache behaviour for
nowwith rounding #4955 (issues: #4846, #4947) -
Expose
dist/pre/postoptions for SpanNotQuery #4452
-
- REST
-
-
Suppress rest exceptions by default and log them instead #12991
-
Create Snapshot: remove _create from POST path to match PUT #11928 (issue: #11897)
-
Add
rewritequery parameter to theindices.validate_queryAPI spec #11580 (issue: #10147) -
Unify query_string parameters parsing #11057
-
HttpServer: Support relative plugin paths #10975 (issue: #10958)
-
Remove global
sourceparameter from individual APIs in REST spec #10863 -
Add more utilities for source/body handling in RestAction #10724
-
Add option to only return simple exception messages #10117
-
Expose
master_timeoutflag onGET _template&HEAD _template#9688 -
Add support for multi-index query parameters for
_cluster/state#9295 (issue: #5229) -
Support JSON request body in scroll, clear scroll, and analyze APIs #9076 (issue: #5866)
-
Adds parameters to API endpoint cluster put settings specification #8769
-
Added
_shardsheader to all write responses. #7994 -
Changed the root rest endpoint (/) to use cluster service #7933 (issue: #7899)
-
Add the cluster name to the "/" endpoint #7524
-
A content decompressor that throws a human readable message when #7241
-
Added missing percolate API parameters to the rest spec #7173
-
Add REST API spec for /_search_shards endpoint #5907
-
Rest layer refactoring phase 2 + recycling in http layer #5708
-
Add
explainflag support to the reroute API #5027 (issues: #2483, #5169) -
Throw exception if an additional field was placed inside the "query" body #4913 (issue: #4895)
-
REST API: Consistent get field mapping response #4822 (issue: #4738)
-
Rest API: Ensure 503 signals == retry on another node [ISSUE] #4066
-
- Recovery
-
-
Reduce cluster update reroutes with async fetch #11421
-
Check if the index can be opened and is not corrupted on state listing #11269 (issue: #11226)
-
No need to send mappings to the master node on phase 2. #11207
-
Allow to recover into a folder containing a corrupted shard #10558
-
Integrate translog recovery into Engine / InternalEngine #10452
-
Only cancel recovery when primary completed relocation #10218
-
Wipe shard state before switching recovered files live #10179 (issue: #10053)
-
Engine: update index buffer size during recovery and allow configuring version map size #10046 (issues: #6363, #6667)
-
Unify RecoveryState management to IndexShard and clean up semantics #9902 (issue: #9503)
-
Only iterate the files that we recovered from the commit #9761
-
Add a timeout to local mapping change check #9575
-
Node shut down during the last phase of recovery needlessly fails shard [ISSUE] #9496
-
Flush immediately after a remote recovery finishes (unless there are ongoing ones) #9439
-
Don’t throttle recovery indexing operations #9396 (issue: #9394)
-
Release store lock before blocking on mapping updates #9102
-
Ensure shards are identical after recovery #8723
-
Be more resilient to partial network partitions #8720
-
Throw IndexShardClosedException if shard is closed #8648
-
Allow to cancel recovery sources when shards are closed #8555
-
Refactor RecoveryTarget state management #8092 (issues: #7315, #7893)
-
During recovery, mark last file chunk to fail fast if payload is truncated #7830
-
Remove unneeded waits on recovery cancellation #7717
-
Set a default of 5m to
recover_after_timewhen any of theexpected*Nodesis set #6742 -
Add a best effort waiting for ongoing recoveries to cancel on close #6741
-
Cancel recovery if shard on the target node closes during recovery operation #6645
-
RecoveryID should not be a per JVM but per Node #6207
-
Before deleting a local unused shard copy, verify we’re connected to the node it’s supposed to be on #6191
-
Change default recovery throttling to 50MB / sec #5913
-
Fail replica shards locally upon failures #5847 (issue: #5800)
-
- Scripting
-
-
Allow scripts to expose whether they use the
_score. #12695 -
Add path.scripts directory #12668
-
Simplify CacheKey used for scripts #12092
-
Allow executable expression scripts for aggregations #11689 (issue: #11596)
-
Unify script and template requests across codebase #11164 (issues: #10113, #10810, #11091)
-
Minor TimeZone Fix #10994
-
Run groovy scripts with no permissions #10969
-
Add Field Methods #10890
-
Allow plugins to define custom operations that they use scripts for #10419 (issue: #10347)
-
Add String to the default whitelisted receivers #9837 (issue: #8866)
-
Make
script.groovy.sandbox.method_blacklist_patchtruly append-only #9473 -
Make groovy sandbox method blacklist dynamically additive #9470
-
Add explicit error message when script_score script returns NaN #8750 (issue: #2426)
-
Use groovy-x.y.z-indy jar for better scripting performance #8183 (issue: #8182)
-
Add GroovyCollections to the sandbox whitelist #7250 (issues: #7088, #7089)
-
Make ScoreAccessor utility class publicly available for other script engines #6898 (issue: #6864)
-
Remove setNextScore in SearchScript. #6864
-
Add a transformer to translate constant BigDecimal to double #6609
-
Add Groovy sandboxing for GString-based method invocation #6596
-
Fix optional default script loading #6582
-
Exposed _uid, _id and _type fields as stored fields (_fields notation) #6406
-
- Search
-
-
Split SearchModule.configure() into separate methods #12827
-
Only compute scores when necessary with FiltersFunctionScoreQuery. #12707
-
Speed up the
function_scorequery when scores are not needed. #12693 -
Add _replica and _replica_first as search preference. #12244 (issue: #12222)
-
Term Query: Be more strict during parsing #12195 (issue: #12184)
-
Clean up handling of missing values when merging shard results on the coordinating node. #12127 (issue: #9155)
-
Always return metadata in get/search APIs. #11816 (issue: #8068)
-
Search
preferencebased on node specification #11464 (issue: #5925) -
Do not specialize TermQuery vs. TermsQuery. #11308
-
Minor refactor of MultiValueMode removing apply and reduce #11290
-
Count api to become a shortcut to the search api #11198 (issues: #9110, #9117)
-
Make SCAN faster. #11180
-
Remove (dfs_)query_and_fetch from the REST API #10864 (issue: #9606)
-
Cut over to IndexSearcher.count. #10674
-
Single value numeric queries shouldn’t be handled by NumericRangeQuery #10648 (issue: #10646)
-
Query scoring change for single-value queries on numeric fields #10631 (issue: #10628)
-
Remove unused normsField from MatchAllQuery #10592
-
Replace deprecated filters with equivalent queries. #10531 (issue: #8960)
-
Avoid calling DocIdSets.toSafeBits. #9546
-
Parse terms filters on a single term as a term filter. #9014
-
Close active search contexts on SearchService#close() #8947 (issue: #8940)
-
Surgically removed slow scroll #8780
-
Filter cache: add a
_cache: autooption and make it the default. #8573 (issue: #8449) -
Do not force the post-filter to be loaded into a BitSet. #8488
-
Reduce memory usage during fetch source sub phase #8138
-
Don’t let
tookbe negative. #7968 -
Use FixedBitSetFilterCache for delete-by-query #7581 (issue: #7037)
-
Speed up string sort with custom missing value #7005
-
Wrap filter only once in ApplyAcceptedDocsFilter #6873
-
Remove Queries#optimizeQuery - already handled in BooleanQuery #6743
-
Return missing (404) if a scroll_id is cleared that no longer exists. #5865 (issue: #5730)
-
Speed up
existsandmissingfilters on high-cardinality fields [ISSUE] #5659 -
Freq terms enum #5597
-
Capture and set start time in Delete By Query operations #5540
-
Add dedicated /_search/template endpoint for query templates #5353
-
Add failures reason to delete by query response #5095 (issue: #5093)
-
Use patched version of ReferenceManager to prevent infinite loop in ReferenceManager#acquire() #5043
-
Improve scroll search by using IndexSearcher#searchAfter(…) #4968 (issue: #4940)
-
- Settings
-
-
Throw Exception for missing settings file #12833 (issue: #11510)
-
Remove lenient store type parsing #12735
-
Add node setting to send SegmentInfos debug output to System.out #11546
-
ResourceWatcher: Rename settings to prevent watcher clash #11359 (issues: #11033, #11175)
-
Remove
cluster.routing.allocation.balance.primary#9159 -
Add
http.publish_portsetting to the HTTP module #8807 (issue: #8137) -
Don’t accept a dynamic update to
min_master_nodeswhich is larger then current master node count #8321 -
Validates bool values in yaml for node settings #8186 (issue: #8097)
-
Store index creation time in index metadata #7218 (issue: #7119)
-
Make
cluster.routing.allocation.allow_rebalancea dynamic setting #7095 (issue: #7092) -
Security: Allow to configure CORS allow-credentials header to work via SSL #7059 (issue: #6380)
-
Allow
index.merge.scheduler.max_thread_countto be dynamically changed #6925 (issue: #6882) -
Security: Support regular expressions for CORS allow-origin to match against #6923 (issues: #5601, #6891)
-
Added three frequency levels for resource watching #6896
-
Added more utility methods to Settings #6840
-
Improve Settings#get lookup for camel case support #6765
-
Security: Make JSONP responses optional. #6164
-
Allow to change concurrent merge scheduling setting dynamically #6098
-
Trimmed the main
elasticsearch.ymlconfiguration file #5861 -
Throw error when incorrect setting applied to
auto_expand_replicas[ISSUE] #5752 -
Add
getAsRatioto Settings class, allow DiskThresholdDecider to take percentages #5690 -
Corrected issue with throttle type setting not respected upon updates #5392
-
Made possible to dynamically update
discovery.zen.publish_timeoutcluster setting #5068 (issue: #5063)
-
- Shadow Replicas
- Snapshot/Restore
-
-
Add support for bulk delete operation in snapshot repository #12587 (issue: #12533)
-
Create a directory during repository verification #12323 (issue: #11611)
-
Add validation of snapshot FileInfo during parsing #12108
-
Add checksum to snapshot metadata files #12002 (issue: #11589)
-
Snapshot info should contain version of elasticsearch that created the snapshot #11985 (issue: #11980)
-
Extract all shard-level snapshot operation into dedicated SnapshotShardsService #11756
-
Add snapshot name validation logic to all snapshot operations #11617
-
Change metadata file format #11507
-
Add support for applying setting filters when displaying repository settings #11270 (issue: #11265)
-
Check that reading indices is allowed before creating their snapshots #11133
-
Don’t throw an exception if repositories are unregistered with
*#11113 -
Improve the error message when attempting to snapshot a closed index #10608 (issue: #10579)
-
AbstractBlobContainer.deleteByPrefix() should not list all blobs #10366 (issue: #10344)
-
Batching of snapshot state updates #10295
-
Refactor how restore cleans up files after snapshot was restored #9770
-
Add ability to retrieve currently running snapshots #9400 (issues: #7859, #8782, #8887)
-
Add support for changing index settings during restore process #9285 (issue: #7887)
-
Override write(byte[] b, int off, int len) in FilterOutputStream for better performance #8749 (issue: #8748)
-
Allow custom metadata to specify whether or not it should be in a snapshot #7901 (issue: #7900)
-
Write Snapshots directly to the blobstore stream #7637
-
It should be possible to restore an index without restoring its aliases [ISSUE] #6457
-
Snapshot/Restore: Add ability to restore partial snapshots #6368 (issue: #5742)
-
Switch to shared thread pool for all snapshot repositories #6182 (issue: #6181)
-
Improve speed of running snapshot cancelation #5244 (issue: #5242)
-
Add ability to get snapshot status for running snapshots #5123 (issue: #4946)
-
Add throttling to snaphost and restore operations #4891 (issue: #4855)
-
- Stats
-
-
Refactor, remove _node/network and _node/stats/network. #12922 (issue: #12889)
-
Expose ClassloadingMXBean in Node Stats #12764 (issue: #12738)
-
Count scans in search stats and add metrics for scrolls #12069 (issue: #9109)
-
Field stats: added index_constraint option #11259 (issue: #11187)
-
Add CommitStats to supply information about the current commit point #10687
-
Add throttle stats to index stats and recovery stats #10097
-
Recovery: add total operations to the
_recoveryAPI #10042 (issue: #9368) -
Add pending tasks count to cluster health #9877
-
Hot threads should include timestamp and params #9773
-
Added
verboseoption to segments api, with full ram tree as first additional element per segment #9111 -
Add
ignore_idle_threads(default: true) to hot threads #8985 (issue: #8908) -
Add more fine grained memory stats from Lucene segment reader to index stats #8832
-
Add time in index throttle to index stats. #7896 (issue: #7861)
-
Add
segments.index_writer_max_memoryto stats #7440 (issues: #6483, #7438) -
Track the number of times the CircuitBreaker has been tripped #6134 (issue: #6130)
-
Disable RAM usage estimation on Lucene 3.x segments. #5202 (issue: #5201)
-
- Store
-
-
Fall back to reading SegmentInfos from Store if reading from commit fails #11403 (issue: #11361)
-
Consolidate directory lock obtain code #11390
-
Read segment info from latest commit whenever possible #11361
-
Schedule pending delete if index store delete fails #9856
-
Improve safety when deleting files from the store #9801
-
Use Directory#fileLength() less during calculating checksums #9689
-
Cache fileLength for fully written files #9683
-
Populate metadata.writtenBy for pre 1.3 index files. #9152
-
Expose ShardId via LeafReader rather than Directory API #8812
-
Synchronize operations that modify file mappings on DistributorDirectory #8408
-
Drop pre 0.90 compression BWC #8385
-
Use DistributorDirectory only if there are more than one data direcotry #8383
-
Cut over MetaDataStateFormat to NIO Path API #8297
-
Remove special file handling from DistributorDirectory #8276
-
Try to increment store before searcher is acquired #7792
-
Fold two hashFile implemenation into one #7720
-
Before deleting shard verify that another node holds an active shard instance #6692
-
Make a hybrid directory default using
mmapfs/niofs#6636
-
- Suggesters
- Term Vectors
-
-
Only load term statistics if required #11737
-
Requests are now timed #9583
-
Support for shard level caching of term vectors #8395
-
Add support for distributed frequencies #8144
-
Add support for realtime term vectors #7846
-
Support for custom analyzers in term vectors and MLT query #7801
-
Support for artificial documents #7530
-
Support for version and version_type #7480
-
Return found: false for docs requested between index and refresh #7124 (issue: #7121)
-
Adds support for wildcards in selected fields #7061
-
Compute term vectors on the fly if not stored in index #6567 (issue: #5184)
-
- Top Hits
- Translog
-
-
Make translog file name parsing strict #11875
-
Some smallish translog cleanups #11200
-
Add translog checkpoints to prevent translog corruption #11143 (issues: #10933, #11011)
-
Make modifying operations durable by default. #11011 (issue: #10933)
-
Use buffered translog type also when sync is set to 0 #10993
-
Remove useless random translog directory selection #10589
-
Don’t rename recovery translog in gateway #9719
-
Cut over to Path API #8611
-
Refactor the Translog.read(Location) method #7780
-
Write translog opSize twice #7735
-
Remove unused stream #7683
-
Clean up translog interface #7564
-
Set default translog
flush_threshold_opsto unlimited, to flush by byte size by default and not penalize tiny documents #6783 (issue: #6443) -
Use unlimited
flush_threshold_opsfor translog (again) [ISSUE] #6726 -
Raise proper failure if not fully reading translog entry #6562
-
Use unlimited
flush_threshold_opsfor translog #5900 -
Fix visibility in buffered translog #5609
-
Use BytesReference to write to translog files #5463
-
Don’t throttle the translog stage of recovery [ISSUE] #4890
-
- Tribe Node
-
-
Index level blocks, index conflict settings #5501
-
- Upgrade API
Bug fixes
- Aggregations
-
-
Aggregation: Fix AggregationPath.subPath() to not throw ArrayStoreException #13035
-
Throw error if cardinality aggregator has sub aggregations #12989 (issue: #12988)
-
Full path validation for pipeline aggregations #12595 (issue: #12360)
-
Upgrade HDRHistogram to version 2.1.6. #12554
-
Fixes serialization of HDRHistogram in percentiles aggregations #12505
-
Fix cidr mask conversion issue for 0.0.0.0/0 and add tests #12005 #12430 (issue: #12005)
-
Adds new script API to ValuesSourceMetricsAggregationBuilder #12152
-
Aggregations: Makes SKIP Gap Policy work correctly for Bucket Script aggregation #11970
-
moving_avgforecasts should not include current point #11641 -
Allow aggregations_binary to build and parse #11473 (issue: #11457)
-
Fix bug where moving_avg prediction keys are appended to previous prediction #11465 (issue: #11454)
-
Sibling Pipeline Aggregations can now be nested in SingleBucketAggregations #11380 (issue: #11379)
-
Fixed Moving Average prediction to calculate the correct keys #11375 (issue: #11369)
-
Queries with
size:0break aggregations that need scores #11358 (issue: #11119) -
Fix geo bounds aggregation when longitude is 0 #11090 (issue: #11085)
-
Fixes Infinite values return from geo_bounds with non-zero bucket-ordinals #10917 (issue: #10804)
-
Sampler agg could not be used with Terms agg’s order. #10785 (issue: #10719)
-
Fix
_as_stringoutput to only show when format specified #10571 (issue: #10284) -
Fix multi-level breadth-first aggregations. #10411 (issues: #9544, #9823)
-
Be lenient when converting local to utc time in time zone roundings #10031 (issue: #10025)
-
Prevent negative intervals in date_histogram #9690 (issue: #9634)
-
Make the nested aggregation call sub aggregators with doc IDs in order #9548 (issue: #9547)
-
Remove limitation on field access within aggs to the types provided in the search #9487
-
Validate the aggregation order on unmapped terms in terms agg. #8952 (issue: #8946)
-
Fix date_histogram issues during a timezone DST switch #8655 (issue: #8339)
-
Fix geohash grid doc counts computation on multi-valued fields #8574 (issue: #8512)
-
Parser throws NullPointerException when Filter aggregation clause is empty #8527 (issue: #8438)
-
Fixes scripted metrics aggregation when used as a sub aggregation #8037 (issue: #8036)
-
Makes script params consistent with other APIs in scripted_metric #7969
-
Significant terms can throw error on index with deleted docs. #7960 (issue: #7951)
-
Fixes resize bug in Geo bounds Aggregator #7565 (issue: #7556)
-
The nested aggregator should iterate over the child doc ids in ascending order. #7514 (issue: #7505)
-
Fixes pre and post offset serialisation for histogram aggs #7313 (issue: #7312)
-
key_as_stringonly shown when format specified in terms agg #7160 (issue: #7125) -
Fixed value count so it can be used in terms order #7051 (issue: #7050)
-
Fix infinite loop in the histogram reduce logic. #7022 (issue: #6965)
-
More lenient type parsing in histo/cardinality aggs #6948 (issue: #6893)
-
Fix JSON response for significant terms #6535
-
Fix cardinality aggregation when doc values field is empty #6413
-
Fixed conversion of date field values when using multiple date formats #6266 (issue: #6239)
-
Fail queries that have two aggregations with the same name. #6258 (issue: #6255)
-
Fix DateHistogramBuilder to use a String pre_offset and post_offset #5587 (issue: #5586)
-
DateHistogram.Bucket should return the date key in UTC [ISSUE] #5477
-
Fix cardinality memory-usage considerations. #5452
-
Allow scripts to return more than 4 values in aggregations. #5416 (issue: #5414)
-
Invoke postCollection on aggregation collectors #5387
-
Fixed a bug in date_histogram aggregation parsing #5379 (issue: #5375)
-
Fix NPE/AIOOBE when building a bucket which has not been collected. #5250 (issue: #5048)
-
Changed the caching of FieldDataSource in aggs to be based on field name… #5205 (issue: #5190)
-
date_histogram against empty index results in ArrayIndexOutOfBoundsException [ISSUE] #5179
-
Fix BytesRef owning issue in string terms aggregations. #5039 (issue: #5021)
-
Fix hashCode values of aggregations' BytesValues. #5006 (issue: #5004)
-
Sorting terms agg by sub-aggegation doesn’t respect asc/desc when executing on a single shard [ISSUE] #4951
-
Fixed an issue where there are sub aggregations executing on a single shard #4869 (issue: #4843)
-
- Aliases
- Allocation
-
-
Fix messaging about delayed allocation #12515 (issue: #12456)
-
ThrottlingAllocationDecider should not counting relocating shards #12409
-
Shard Started messages should be matched using an exact match #11999
-
Reroute after node join is processed #11960 (issues: #11776, #11923)
-
GatewayAllocator: reset rerouting flag after error #11519 (issue: #11421)
-
Fix handling of
dangling_timeoutset to 0 andauto_import_dangledtrue #8257 -
Enable ClusterInfoService by default #8206
-
Improve handling of failed primary replica handling #6816 (issue: #6808)
-
Do not ignore ConnectTransportException for shard replication operations #6813
-
Failed shards could be re-assigned to the same nodes if multiple replicas failed at once #5725
-
BalancedShardAllocator makes non-deterministic rebalance decisions [ISSUE] #4867
-
- Analysis
-
-
Custom analyzer names and aliases must not start with _ #11303 (issue: #9596)
-
Fix tokenizer settings in SynonymTokenFilterFactory #10489
-
CharArraySet doesn’t know how to lookup the original string in an ImmutableList. #6238 (issue: #6237)
-
Analyze API: Default analyzer accidentally removed stopwords #6043 (issue: #5974)
-
- Bulk
-
-
Fix: Use correct OpType on Failure in BulkItemResponse #12060 (issue: #9821)
-
Allow null values in the bulk action/metadata line parameters #11459 (issue: #11458)
-
Throw exception if unrecognized parameter in bulk action/metadata line #11331 (issue: #10977)
-
_default_mapping should be picked up from index template during auto create index from bulk API #10762 (issue: #10609) -
Removed duplicate timeout param #10205
-
Handle failed request when auto create index is disabled #8163 (issue: #8125)
-
Bulk operation can create duplicates on primary relocation #7729
-
Cluster block with auto create index bulk action can cause bulk execution to not return #7109 (issue: #7086)
-
Do not fail whole request on closed index #6790 (issue: #6410)
-
Fix return of wrong request type on failed updates #6646 (issue: #6630)
-
Bulk request which try and fail to create multiple indices may never return #6436
-
Fix mapping creation on bulk request #5623
-
Ensure that index specific failures do not affect whole request #4995 (issue: #4987)
-
Failed preparsing does not fail whole bulk request #4781 (issue: #4745)
-
- CAT API
-
-
_cat/nodes: Thread null handling through stats and info #9938 (issue: #6297)
-
Fix NullPointerException in cat-recovery API #6190
-
_cat/allocation returns -1 as disk.total for clients nodes [ISSUE] #5948
-
ElasticsearchIllegalStateException when invoking _cat plugins [ISSUE] #5715
-
Node version sometimes empty in _cat/nodes [ISSUE] #5480
-
- CRUD
-
-
detect_noopnow understandsnullas a valid value #11210 (issue: #11208) -
The parent option on update request should be used for upsert only. #9612 (issue: #4538)
-
Don’t throw DAEE on replica for create operation; use IW.updateDocument/s instead #7146 (issue: #7142)
-
MultiGet: Fail when using no routing on an alias to an index that requires routing #7145
-
Add parameter to GET API for checking if generated fields can be retrieved #6973 (issue: #6676)
-
DocumentMissingException is also thrown on update retries #6724 (issue: #6355)
-
- Cache
-
-
Store filter cache statistics at the shard level instead of index. #11886
-
Don’t use bitset cache for children filters. #10663 (issues: #10629, #10662)
-
Remove query-cache serialization optimization. #9500 (issue: #9294)
-
Queries are never cached when date math expressions are used (including exact dates) #9269 (issue: #9225)
-
Change default eager loading behaviour for nested fields and parent/child in bitset cache #8440
-
Don’t eagerly load NestedDocsFilter in bitset filter cache, because it is never used. #8414 (issue: #8394)
-
Fixed bitset filter cache leftover in nested filter #8303
-
Cleanup non nested filter to not flip the FixedBitSet returned by the wrapped filter. #8232 (issue: #8227)
-
Move the child filter over to the fixed bitset cache. #8171
-
- Circuit Breakers
-
-
Make "noop" request breaker a non-dynamic setting #8179
-
Only set
breakerwhen stats are retrieved #7721 -
Percolator doesn’t reduce CircuitBreaker stats in every case. #5588
-
Fix possible discrepancy in circuit breaker in parent/child #5526
-
Fix issue where circuit breaker was always reset to 80% upon startup #5334
-
NullPointerException in RamAccounntingTermsEnum [ISSUE] #5326
-
- Cluster
-
-
Changes in unassigned info and version might not be transferred as part of cluster state diffs #12387
-
Rename cluster state uuid to updateId #11862 (issue: #11831)
-
ClusterHealth shouldn’t fail with "unexpected failure" if master steps down while waiting for events #11493
-
Write state also on data nodes if not master eligible #9952 (issue: #8823)
-
GatewayService should register cluster state listener before checking for current state #8789
-
Extend refresh-mapping logic to the
_default_type #8413 (issue: #4760) -
ClusterHealthAPI does not respect waitForEvents when local flag is set #7731
-
Use node’s cluster name as a default for an incoming cluster state who misses it #7414 (issue: #7386)
-
Use the provided cluster state instead of fetching a new cluster state from cluster service. #7013
-
During recovery, only send mapping updates to master if needed #6772 (issue: #6762)
-
Check for index blocks against concrete indices on master operations [ISSUE] #6694
-
Also send Refresh and Flush actions to relocation targets #6545
-
Do not execute cluster state changes if current node is no longer master #6230
-
TransportMasterNodeOperationAction: tighter check for postAdded cluster state change #5548 (issue: #5499)
-
- Core
-
-
ThreadPools: schedule a timeout check after adding command to queue #12319
-
Throw LockObtainFailedException exception when we can’t lock index directory #12203
-
Only clear open search ctx if the index is delete or closed via API #12199 (issue: #12116)
-
Close lock even if we fail to obtain #11412
-
Balance new shard allocations more evenly on multiple path.data #11185 (issue: #11122)
-
Use System.nanoTime for elapsed time #11058
-
Increase default rate limiting for snapshot, restore and recovery to 40 MB/sec #10185 (issue: #6018)
-
Also throttle delete by query when merges fall behind #9986
-
Promptly cleanup updateTask timeout handler #9621
-
Verify the index state of concrete indices after alias resolution #9057
-
ignore_unavailableshouldn’t ignore closed indices #9047 (issue: #7153) -
Hard wire utf-8 encoding, so unicode filenames work #8847
-
Remove unnecessary index removal on index creation #8639
-
Remove _state directory if index has been deleted #8610
-
Return 0 instead of -1 for unknown/non-exposed ramBytesUsed() #8291 (issue: #8239)
-
Don’t catch FNF/NSF exception when reading metadata #8207
-
Don’t handle FNF exceptions when reading snapshot #8086
-
Force refresh when versionMap is using too much RAM #6443 (issue: #6378)
-
Use XNativeFSLockFactory instead of the buggy Lucene 4.8.1 version [ISSUE] #6424
-
Don’t report terms as live if all it’s docs are filtered out #6221 (issue: #6211)
-
Ensure pending merges are updated on segment flushes #5780 (issue: #5779)
- Dates
- Discovery
-
-
Make sure NodeJoinController.ElectionCallback is always called from the update cluster state thread #12372
-
ZenDiscovery: #11960 failed to remove eager reroute from node join #12019
-
Node receiving a cluster state with a wrong master node should reject and throw an error #9963
-
Check index uuid when merging incoming cluster state into the local one #9541 (issue: #9489)
-
Only retry join when other node is not (yet) a master #8972
-
Removed unnecessary DiscoveryService reference from LocalDiscovery #8415 (issue: #8539)
-
Improve the lifecycle management of the join control thread in zen discovery. #8327
-
UnicastZenPing - use temporary node ids if can’t resolve node by it’s address #7719
-
Transport client: Don’t add listed nodes to connected nodes list in sniff mode #7067 (issues: #6811, #6829, #6894)
-
Handle ConnectionTransportException during a Master/Node fault detection ping during Discovery #6686
-
- Engine
-
-
Fix NPE when streaming commit stats #11266
-
Sync translog before closing engine #10484
-
Fixes InternalIndexShard callback handling of failure #8644 (issue: #5945)
-
Add current translog ID to commit meta before closing #8245
-
The
index.fail_on_corruptionsetting is not updateable #6941 -
Closing an IndexReader on an already relocated / closed shard can cause memory leaks [ISSUE] #5825
-
- Exceptions
-
-
Use StartupError to format all exceptions hitting the console #13041 (issue: #13040)
-
Improve console logging on startup exception #12976
-
Don’t special-case on ElasticsearchWrapperException in toXContent #12015 (issue: #11994)
-
Implement toXContent on ShardOpertionFailureException #11155 (issue: #11017)
-
Remove double exception handling that causes false replica failures #10990
-
Fixing copy/paste mistake in SearchRequest.extraSource’s exception message #8118 (issue: #8117)
-
Turn unexpected exceptions when reading segments into CorruptedIndexException #7693
-
Throw better error if invalid scroll id is used #5738 (issue: #5730)
-
- Fielddata
- Geo
-
-
Correct ShapeBuilder coordinate parser to ignore values in 3rd+ dimension #10539 (issue: #10510)
-
Fix hole intersection at tangential coordinate #10332 (issue: #9511)
-
Fix validate_* merge policy for GeoPointFieldMapper #10165 (issue: #10164)
-
Correct bounding box logic for GeometryCollection type #9550 (issue: #9360)
-
Throw helpful exception for Polygons with holes outside of shell #9105 (issue: #9071)
-
GIS envelope validation #9091 (issues: #2544, #8672, #9067, #9079, #9080)
-
Fix for NPE enclosed in SearchParseException for a "geo_shape" filter or query #8785 (issue: #8432)
-
Fix for geohash neighbors when geohash length is even. #8529 (issue: #8526)
-
Fix geohash grid aggregation on multi-valued fields. #8513 (issue: #8507)
-
Fix for ArithmeticException[/ by zero] when parsing a polygon #8475 (issue: #8433)
-
Remove unnecessary code from geo distance builder #8338
-
Fix IndexedGeoBoundingBoxFilter to not modify the bits of other filters. #8325
-
Improved error handling in geo_distance #7272 (issue: #7260)
-
Fixes computation of geohash neighbours #7247 (issue: #7226)
-
Fix geo_shapes which intersect dateline #7188 (issue: #7016)
-
optimize_bboxfor geo_distance filters can cause missing results [ISSUE] #6008
-
- Highlighting
-
-
Plain highlighter to use analyzer defined on a document level #6267 (issue: #5497)
-
Implement BlendedTermQuery#extractTerms to support highlighing. #5247 (issue: #5246)
-
Made SearchContextHighlight.Field class immutable to prevent from unwanted updates #5223 (issue: #5175)
-
Highlighting on a wildcard field name causes the wildcard expression to be returned rather than the actual field name [ISSUE] #5221
-
Fixed multi term queries support in postings highlighter for non top-level queries #5143 (issues: #4052, #5127)
-
- Index APIs
-
-
Remove expansion of empty index arguments in RoutingTable #10148 (issue: #9081)
-
Fix to make GET Index API consistent with docs #9178 (issue: #9148)
-
Fix GET index API always running all features #8392
-
Fix optimize behavior with force and flush flags. #7920 (issues: #7886, #7904)
-
Fixed validate query parsing issues #6114 (issues: #6111, #6112, #6116)
-
- Indexed Scripts/Templates
-
-
ScriptService can deadlock entire nodes if script index is recovering #8901
-
GetIndexedScript call can deadlock #8266
-
Make template params take arrays #8255
-
Cleaned up various issues #7787 (issues: #7559, #7560, #7567, #7568, #7647)
-
Change the default
auto_expandfor the.scriptsindex to0-all#7502 -
Fix
.scriptindex template. #7500
- Inner Hits
-
-
Reset the
ShardTargetTypeafter serializing inner hits. #12261 -
Properly support named queries for both nested and parent child inner hits #11880 (issues: #10661, #10694)
-
Fix multi level parent/child bug #11199
-
Make sure size=0 works on the inner_hits level. #10388 (issue: #10358)
-
Make sure inner hits also works for nested fields defined in object field #10353 (issue: #10334)
-
Fix bug where parse error is thrown if a inner filter is used in a nested filter/query. #10309 (issue: #10308)
-
Don’t fail if an object is specified as a nested value instead of an array. #9743 (issue: #9723)
-
Make sure inner hits defined on has_parent query resolve hits properly #9384
-
- Internal
-
-
Add plugin modules before (almost all) others #13061 (issue: #12783)
-
Workaround JDK bug 8034057 #12970
-
Fix concurrency issue in PrioritizedEsThreadPoolExecutor. #12599
-
Fix ShardUtils#getElasticsearchDirectoryReader() #12594
-
don’t represent site plugins with null anymore #12577
-
IndicesStore shouldn’t try to delete index after deleting a shard #12494 (issue: #12487)
-
ShardUtils#getElasticsearchLeafReader() should use FilterLeafReader#getDelegate() instead of FilterLeafReader#unwrap #12437
-
Fix serialization of IndexFormatTooNewException and IndexFormatTooOldException #12277
-
Decode URL.getPath before resolving a real file #11940
-
Add a null-check for XContentBuilder#field for BigDecimals #11790 (issue: #11699)
-
AsyncShardFetch can hang if there are new nodes in cluster state #11615
-
Make JNA optional for tests and move classes to bootstrap package #11378 (issue: #11360)
-
Transport: remove support for reading/writing list of strings, use arrays instead #11276 (issue: #11056)
-
Fix CompressedString.equals. #11233
-
ThreadPool: make sure no leaking threads are left behind in case of initialization failure #11061 (issue: #9107)
-
Propagate headers & contexts to sub-requests #11060 (issue: #10979)
-
Fix NPE in PendingDelete#toString #11032
-
Ensure that explanation descriptions are not null on serialization #10689 (issue: #10399)
-
Fix possible NPE in InternalClusterService$NotifyTimeout, the
futurefield is set from a different thread #10630 (issue: #3) -
Allow ActionListener to be called on the network thread #10573 (issue: #10402)
-
Add missing hashCode method to RecoveryState#File #10501
-
Don’t try to send a mapping refresh if there is no master #10311 (issue: #10283)
-
Fix PageCacheRecycler’s max page size computation. #10087 (issue: #10077)
-
Close all resources if doStart fails #9898
-
Snapshot status api: make sure headers are handed over to inner nodes request #9409
-
Fix equality check of timevalue after serialization #9218
-
AdapterActionFuture should not set currentThread().interrupt() #9141 (issue: #9001)
-
PlainTransportFuture should not set currentThread().interrupt() #9001
-
IndexService - synchronize close to prevent race condition with shard creation #8557
-
When corruption strikes, don’t create exceptions with circular references #8331
-
In fixed bitset service fix order where the warmer listener is added. #8168 (issue: #8140)
-
Only schedule another refresh if
refresh_intervalis positive #8087 (issue: #8085) -
Fix serialization of PendingClusterTask.timeInQueue. #8077
-
Dangling indices import ignores aliases #8059
-
Make close() synchronized during node shutdown #7885
-
Ensure GroupShardsIterator is consistent across requests, to ensure consistent sorting #7698
-
Fix ordering of Regex.simpleMatch() parameters #7661 (issue: #7651)
-
Use
SEARCHthreadpool for potentially blocking operations #7624 (issue: #7623) -
Make network interface iteration order consistent #7494
-
Add all unsafe variants of LZF compress library functions to forbidden APIs. #7468 (issue: #8078)
-
Wait until engine is started up when acquiring searcher #7456 (issue: #7455)
-
Made original indices optional for broadcast delete and delete by query shard requests #7406
-
Force optimize was not passed to shard request #7405 (issue: #7404)
-
Fixed a request headers bug in transport client #7302
-
Fix explanation streaming #7257
-
VerboseProgress(PrintWriter) does not set the writer #7254
-
Fix a very rare case of corruption in compression used for internal cluster communication. #7210
-
Fix BytesStreamInput(BytesReference) ctor with nonzero offset #7197
-
Fix serialization bug in reroute API #7135
-
Support parsing lucene minor version strings #7055
-
Fix connect concurrency, can cause connection nodes to close #6964
-
When serializing HttpInfo, return null info if service is not started #6906
-
Remove indicesLifecycle.Listener from IndexingMemoryController #6892
-
Fix possible NPE during shutdown for requests using timeouts #6849
-
Prevent NPE if engine is closed while version map is checked #6786
-
If the node initialisation fails, make sure the node environment is closed correctly #6715
-
IndexingMemoryController should only update buffer settings of fully recovered shards #6667 (issue: #6642)
-
Fix possible race condition in checksum name generator #6662
-
Allow to serialize negative thread pool sizes #6486 (issues: #5357, #6325)
-
The
ignore_unavailableoption should also ignore indices that are closed #6475 (issue: #6471) -
Guava doesn’t explicitly remove entries when clearing the entire cache [ISSUE] #6296
-
MetaData#concreteIndices to throw exception with a single index argument if allowNoIndices == false #6137
-
Restore read/write visibility in PlainShardsIterator. #6039 (issue: #5561)
-
Use TransportBulkAction for internal request from IndicesTTLService #5795 (issue: #5766)
-
Take stream position into account when calculating remaining length #5677 (issue: #5667)
-
Fix some warnings reported by Findbugs. #5571
-
Assorted fixes for bugs in the PagedBytesReference tests #5549
-
Fix for zero-sized content throwing off toChannelBuffer(). #5543
-
Count latch down if sendsPing throws exception #5440
-
Fix yamlBuilder() to return YAML builder instead of SMILE #5186 (issue: #5185)
-
- Java API
-
-
Add missing support for escape to QueryStringQueryBuilder #13016
-
Fix PrefixQueryBuilder to support an Object value #12124 (issue: #12032)
-
Properly fix the default regex flag to ALL for RegexpQueryParser and Builder #12067 (issue: #11896)
-
Java api: add missing support for boost to GeoShapeQueryBuilder and TermsQueryBuilder #11810 (issue: #11744)
-
Add missing rewrite parameter to FuzzyQueryBuilder #11139 (issue: #11130)
-
Ensure netty I/O thread is not blocked in TransportClient #10644
-
toString for SearchRequestBuilder and CountRequestBuilder #9944 (issues: #5555, #5576)
-
Added missing module registration in TransportClient for Significant Terms #7852 (issue: #7840)
-
Get indexed script shouldn’t allow to set the index [ISSUE] #7553
-
Add back string op type to IndexRequest #7387
-
Fixed the node retry mechanism which could fail without trying all the connected nodes #6829
-
Fix source excludes setting if no includes were provided [ISSUE] #6632
-
BulkRequest#add(Iterable) to support UpdateRequests #6551
-
Make sure afterBulk is always called in BulkProcessor #6495 (issue: #5038)
-
JsonXContentGenerator#writeRawField produces invalid JSON if raw field is the first field in the json object [ISSUE] #5514
-
Fix returning incorrect XContentParser #5510
-
Enforce query instance checking before it wrapper as a filter #5431
-
forceSource highlighting field option doesn’t have any effect when set using the Java API [ISSUE] #5220
-
BulkProcessor process every n+1 docs instead of n #4265
-
- Logging
-
-
Use task’s class name if not a TimedPrioritizeRunnable #11610
-
Fix potential NPE in new tracer log if request timeout #9994 (issue: #9286)
-
Fix example in logging daily rotate configuration #8550 (issue: #8464)
-
Fixes Logger class for BackgroundIndexer #6781
-
Fix format string for DiskThresholdDecider reroute explanation #5749
-
- Mapping
-
-
Move the
murmur3field to a plugin and fix defaults. #12931 (issue: #12874) -
Fix field type compatiblity check to work when only one previous type exists #12779 (issue: #12753)
-
numeric_resolutionshould only apply to dates provided as numbers. #11002 (issue: #10995) -
Wait for mappings to be available on the primary before indexing. #10949
-
Fix
_field_namesto not have doc values #10893 (issue: #10892) -
Explicitly disallow multi fields from using object or nested fields #10745
-
Unneccesary mapping refreshes caused by unordered fielddata settings #10370 (issue: #10318)
-
Fixed an equality check in StringFieldMapper. #10359 (issue: #10357)
-
Fix doc values representation to always serliaze if explicitly set #10302 (issue: #10297)
-
Fix
_field_namesto be disabled on pre 1.3.0 indexes #10268 (issue: #9893) -
Fix
ignore_malformedbehaviour for ip fields #10112 -
Update dynamic fields in mapping on master even if parsing fails for the rest of the doc #9874 (issue: #9851)
-
Throw StrictDynamicMappingException exception if
dynamic:strictand undeclared field value isnull#9445 (issue: #9444) -
Using
default:nullfor _timestamp field creates a index loss on restart #9233 (issues: #9104, #9223) -
Reencode transformed result with same xcontent #8974 (issue: #8959)
-
Serialize doc values settings for _timestamp #8967 (issue: #8893)
-
Update cluster state with type mapping also for failed indexing request #8692 (issue: #8650)
-
Fix conflict when updating mapping with _all disabled #8426 (issues: #7377, #8423)
-
Generate dynamic mappings for empty strings. #8329 (issue: #8198)
-
Throw exception if null_value is set to
null#7978 (issue: #7273) -
Posting a mapping with default analyzer fails #7902 (issue: #2716)
-
Add explicit error when PUT mapping API is given an empty request body. #7618 (issue: #7536)
-
Enable merging of properties in the
_timestampfield #7614 (issues: #5772, #6958, #777) -
Fix
indexsetting in_boostfield #7557 -
Keep parameters in mapping for
_timestampand_sizeeven if disabled #7475 -
Report conflict when trying to disable
_ttl#7316 (issues: #7293, #777) -
Make sure that multi fields are serialized in a consistent order. #7220 (issue: #7215)
-
Fix dynamic mapping of geo_point fields #7175 (issue: #6939)
-
Fix
copy_tobehavior on nested documents. #7079 (issue: #6701) -
Add multi_field support for Mapper externalValue (plugins) #6867 (issue: #5402)
-
Fix possibility of losing meta configuration on field mapping update #6550 (issue: #5053)
-
Allow _version to use
diskas a doc values format. #6523 -
Path-based routing doesn’t work with doc values [ISSUE] #5844
-
geo_point doesn’t allow null values as of 1.1.0 [ISSUE] #5680
-
Check "store" parameter for binary mapper and check "index_name" for all mappers #5585 (issue: #5474)
-
merge GeoPoint specific mapping properties #5506 (issue: #5505)
-
Merge
null_valuefor boolean field and removeinclude_in_allfor boolean field in doc #5503 (issue: #5502) -
Geo Point Fieldmapper: Allow distance for geohash precision #5449 (issue: #5448)
-
Make sure get field mapping request is executed on node hosting the index #5225 (issue: #5177)
-
Added
fieldssupport togeo_pointandcompletionfield type #4963
-
- More Like This
- NOT CLASSIFIED
-
-
Rivers might not get started due to missing _meta document [ISSUE] #4864
-
- Nested Docs
-
-
Nested agg needs to reset root doc between segments. #9441 (issues: #9436, #9437)
-
Fix handling of multiple buckets being emitted for the same parent doc id in nested aggregation #9346 (issues: #8454, #9317)
-
In reverse nested aggregation, fix handling of the same child doc id being processed multiple times. #9345 (issues: #9263, #9346)
-
The parent filter of the nested aggregator isn’t resolved correctly all the time #9335 (issue: #9280)
-
Change nested agg to execute in doc id order #8454
-
If the
_typefield isn’t indexed nested inner docs must be filtered out. #7410 -
The
nestedaggregator should also resolve and use the parentFilter of the closestreverse_nestedaggregator. #7048 (issue: #6994) -
Allow sorting on nested sub generated field #6151 (issue: #6150)
-
A nested
nestedaggregation falls outside of its parentnestedaggregation bounds #5728
-
- Network
-
-
Deduplicate addresses from resolver. #12995
-
Remove usage of
InetAddress#getLocalHost#12959 -
Fix network binding for ipv4/ipv6 #12942 (issues: #12906, #12915)
-
Transport: Do not make the buffer skip while a stream is open. #11988
-
Transport: fix racing condition in timeout handling #10220 (issue: #10187)
-
Fix NPE when initializing an accepted socket in NettyTransport. #6144
-
- Packaging
-
-
Fix variable substitution for OS’s using systemd #12909
-
Fix rpm -e removing /etc/elasticsearch #12785
-
Makes sure all POMs contain a description #12771 (issue: #12550)
-
use spaces liberally in integration tests and fix space handling #12710 (issue: #12709)
-
Fix shaded jar packaging #12589
-
Fix Bootstrap to not call System.exit #12586
-
elasticsearch (again) adds CWD to classpath [ISSUE] #12580
-
Don’t add CWD to classpath when ES_CLASSPATH isn’t set. #12001 (issue: #12000)
-
Fix endless looping if starting fails #11836
-
Fix missing dependencies for RPM/DEB packages #11664 (issue: #11522)
-
Update project.name in bin/elasticsearch script #11348
-
Add antlr and asm dependencies #9696
-
Added quotes to allow spaces in installation path #8428 (issue: #8441)
-
Move forbidden api signature files to dev-tools. #7921 (issue: #7917)
-
Parsing command line args multiple times throws
AlreadySelectedException#7282 -
Shade mustache into org.elasticsearch.common package #6193 (issue: #6192)
-
Export JAVA_HOME in RPM init script #5434
-
Set permission in debian postinst script correctly #5158 (issue: #3820)
-
RPMs: Add timeout to shutdown with KILL signal #4721 (issue: #5020)
-
- Parent/Child
-
-
Explicitly disabled the query cache #12955
-
Fix
_parent.typevalidation #11436 -
Fix 2 bugs in
childrenagg #10263 (issues: #10158, #9544, #9958) -
Post collection the children agg should also invoke that phase on its wrapped child aggs. #9291 (issue: #9271)
-
Fix concurrency issues of the _parent field data. #9030 (issue: #8396)
-
Fixed p/c filters not being able to be used in alias filters. #8649 (issues: #5916, #8628)
-
Missing parent routing causes NullPointerException in Bulk API #8506 (issue: #8365)
-
The
childrenagg didn’t take deleted document into account #8180 -
Check if there is a search context, otherwise throw a query parse exception. #8177 (issue: #8031)
-
has_parentfilter must take parent filter into account when executing the inner query/filter #8020 (issue: #7362) -
A
has_childor other p/c query wrapped in a query filter may emit wrong results #7685 -
Add support for the field data loading option to the
_parentfield. #7402 (issue: #7394) -
If _parent field points to a non existing parent type, then skip the has_parent query/filter #7362 (issue: #7349)
-
Disabled parent/child queries in the delete by query api. #5916 (issue: #5828)
-
Parse has_child query/filter after child type has been parsed #5838 (issue: #5783)
-
Parent / child queries should work with non-default similarities #4979 (issue: #4977)
-
- Percolator
-
-
Support filtering percolator queries by date using
now#12215 (issue: #12185) -
Fix NPE when percolating a document that has a _parent field configured in its mapping #12214 (issue: #12192)
-
Load percolator queries before shard is marked POST_RECOVERY #11799 (issue: #10722)
-
Fail nicely if
nestedquery withinner_hitsis used in a percolator query #11793 (issue: #11672) -
Prevent PercolateResponse from serializing negative VLong #11138
-
Fix wrong use of currentFieldName outside of a parsing loop #10307
-
Support encoded body as query string param consistently #9628
-
Fixed bug when using multi percolate api with routing #9161 (issue: #6214)
-
Pass down the types from the delete mapping request to the delete by query request #7091 (issue: #7087)
-
Fix memory leak when percolating with nested documents #6578
-
Percolator: Fix assertion in percolation with nested docs #6263
-
Add num_of_shards statistic to percolate context #6123 (issue: #6037)
-
The percolator needs to take deleted percolator documents into account. #5843 (issue: #5840)
-
Propagate percolate mapping changes to cluster state #5776
-
Fix highlighting in percolate existing doc api #5108
-
Make highlight query also work in the percolate api #5090
-
Percolator response always returns the
matcheskey. #4882 (issue: #4881)
-
- Plugin Cloud Azure
- Plugin Delete By Query
-
-
Fix number of deleted/missing documents in Delete-By-Query #11745
-
- Plugins
-
-
Fix automatically generated URLs for official plugins in PluginManager #12885
-
strip elasticsearch- and es- from any plugin name #12160 (issues: #12143, #12158)
-
remove elasticsearch- from name of official plugins #12158 (issues: #11805, #12143)
-
Fix pluginmanager permissions for bin/ scripts #12157 (issue: #12142)
-
Only load a plugin once from the classpath #11301
-
Let HTTPS work correctly #10983
-
HTTP: Ensure url path expansion only works inside of plugins #10815
-
Installation failed when directories are on different file systems #9011 (issue: #8999)
-
Plugins failed to load since #8666 #8756
-
Support usage of ES_JAVA_OPTS in plugin commands #8288
-
Fix config path extraction from plugin handle #7935
-
Plugins with only
binandconfigdo not install correctly #7154 (issue: #7152) -
bin/pluginremoves itself [ISSUE] #6745 -
Removing plugin does not fail when plugin dir is read only #6735 (issue: #6546)
-
Fix github download link when using specific version #6321
-
Properly quote $JAVA in bin/plugin #5765
-
NPE in PluginsService when starting elasticsearch with a wrong user #5196 (issues: #4186, #5195)
-
Upgrading analysis plugins fails #5034 (issues: #4936, #5030)
-
- Query DSL
-
-
Add support for
disable_coordparam totermsquery #12756 (issue: #12755) -
Do not track named queries that are null #12691 (issue: #12683)
-
multi_matchquery applies boosts too many times. #12294 -
QueryString ignores maxDeterminizedStates when creating a WildcardQuery #12269 (issue: #12266)
-
Rewrite set twice in WildcardQueryParser [ISSUE] #12207
-
Fix RegexpQueryBuilder#maxDeterminizedStates #12083 (issue: #11896)
-
CommonTermsQuery fix for ignored coordination factor #11780 (issue: #11730)
-
Fix support for
_namein some queries #11694 -
Better exception if array passed to
termquery. #11384 (issue: #11246) -
Score value is 0 in _explanation with random_score query [ISSUE] #10742
-
Avoid NPE during query parsing #10333
-
Function score: Apply
min_scoreto sub query score if no function provided #10326 (issue: #10253) -
function_score: undo "Remove explanation of query score from functions" #9826
-
Fix wrong error messages in MultiMatchQueryParser #8597
-
DateMath: Fix semantics of rounding with inclusive/exclusive ranges. #8556 (issue: #8424)
-
Make simple_query_string leniency more fine-grained #8162 (issue: #7967)
-
Fix NumberFormatException in Simple Query String Query #7876 (issue: #7875)
-
Function Score: Fix explain distance string #7248
-
Function Score: Remove explanation of query score from functions #7245
-
Cache range filter on date field by default #7122 (issue: #7114)
-
Throw exception if function in function score query is null #6784 (issue: #6292)
-
QueryParser can return null from a query #6723 (issue: #6722)
-
Fix MatchQueryParser not parsing fuzzy_transpositions #6300
-
Range/Term query/filter on dates fail to handle numbers properly #5969
-
Fixing questionable PNRG behavior #5613 (issues: #5454, #5578)
-
Add slop to prefix phrase query after parsing query string #5438 (issues: #5005, #5437)
-
Allow edit distances > 2 on FuzzyLikeThisQuery #5374 (issue: #5292)
-
Use FieldMapper to create the low level term queries in CommonTermQuery #5273 (issue: #5258)
-
Make exists/missing behave consistently with exists/missing. #5145 (issue: #5142)
-
Allow specifying nested fields in simple_query_string #5110 (issue: #5091)
-
Added exception to match and multi-match queries if passed an invalid type param #4971 (issue: #4964)
-
Filtered query parses _name incorrectly [ISSUE] #4960
-
Never cache a range filter that uses the
nowdate expression. #4828 (issue: #4846)
-
- Query Refactoring
-
-
Query DSL: don’t cache type filter in DocumentMapper #12447
-
- REST
-
-
Return
408 REQUEST_TIMEOUTif_cluster/healthtimes out #12780 -
fielddata_fieldsquery string parameter was ignored. #11368 (issue: #11025) -
Update RestRequest.java #11305
-
Render non-elasticsearch exception as root cause #10850 (issue: #10836)
-
Add
fielddata_fieldsto the REST spec #9399 (issues: #4492, #9398) -
Get field mapping api should honour pretty flag #8806 (issue: #6552)
-
Passing
fielddata_fieldsas a non array causes OOM #8203 -
Reroute API response didn’t filter metadata #7523 (issue: #7520)
-
Allows all options for expand_wildcards parameter #7290 (issue: #7258)
-
Added support for empty field arrays in mappings #7271 (issue: #6133)
-
Empty HTTP body returned from _recovery API on empty cluster [ISSUE] #5743
-
Search template: Put source param into template variable #5598 (issue: #5556)
-
Fix possible exception in toCamelCase method #5207
-
Source filtering with wildcards broken when given multiple patterns #5133 (issue: #5132)
-
Ignore case when parsing
script_values_sorted|uniquein aggregations. #5010 (issue: #5009) -
scroll REST API should support source parameter #4942 (issue: #4941)
-
Fix potential NPE when no source and no body #4932 (issues: #4892, #4900, #4901, #4902, #4903, #4924)
-
mtermvectors REST API should support source parameter #4910 (issue: #4902)
-
percolate REST API should support source parameter #4909 (issue: #4903)
-
mpercolate REST API should support source parameter #4908 (issue: #4900)
-
msearch REST API should support source parameter #4905 (issue: #4901)
-
mget REST API should support source parameter #4893 (issue: #4892)
-
Cluster state toXContent serialization only returns needed data #4889 (issue: #4885)
-
- Recovery
-
-
Endless recovery loop with
indices.recovery.file_chunk_size=0Bytes#12919 -
Rethrow exception during recovery finalization even if source is not broken #12667
-
Check for incompatible mappings while upgrading old indices #12406 (issue: #11857)
-
Fix MapperException detection during translog ops replay #11583 (issue: #11363)
-
Fix recovered translog ops stat counting when retrying a batch #11536 (issue: #11363)
-
Restart recovery upon mapping changes during translog replay #11363 (issue: #11281)
-
Add engine failure on recovery finalization corruption back #11241
-
Decrement reference even if IndexShard#postRecovery barfs #11201
-
Fail recovery if retry recovery if resetRecovery fails #11149
-
Refactor state format to use incremental state IDs #10316
-
RecoveryState.File.toXContent reports file length as recovered bytes #10310
-
Fail shard when index service/mappings fails to instantiate #10283
-
Gateway: improve assertion at the end of shard recovery #10028
-
Synchronize RecoveryState.timer methods #9943
-
Don’t recover from buggy version #9925 (issues: #7210, #9922)
-
Fix deadlock problems when API flush and finish recovery happens concurrently #9648
-
Handle corruptions during recovery finalization #9619
-
Mapping update task back references already closed index shard #9607
-
Update access time of ongoing recoveries #9506 (issue: #8720)
-
Cleaner interrupt handling during cancellation #9000
-
Harden recovery for old segments #8399
-
Prefer recovering the state file that uses the latest format. #8343
-
Change check for finished to a ref count check #8271 (issue: #8092)
-
RecoveriesCollection.findRecoveryByShard should call recoveryStatus.tryIncRef before accessing fields #8231 (issue: #8092)
-
Mapping check during phase2 should be done in cluster state update task #7744
-
Don’t update indexShard if it has been removed before #7509
-
Increment Store refcount on RecoveryTarget #6844
-
Recovery from local gateway should re-introduce new mappings #6659
-
Honor time delay when retrying recoveries #6226
-
Do not start a recovery process if the primary shard is currently allocated on a node which is not part of the cluster state #6024
-
- Scripting
-
-
Consistently name Groovy scripts with the same content #12296 (issue: #12212)
-
Execute Scripting Engine before searching for inner templates in template query #11512
-
Allow script language to be null when parsing #10976 (issue: #10926)
-
File scripts cache key to include language and prevent conflicts #10033
-
Avoid unnecessary utf8 conversion when creating ScriptDocValues for a string field. #9557 (issue: #6908)
-
Disallow method pointer expressions in Groovy scripting #9509
-
Make _score in groovy scripts comparable #9094 (issue: #8828)
-
Function score and optional weight : avg score is wrong #9004 (issue: #8992)
-
Return new lists on calls to getValues. #8591 (issue: #8576)
-
Add score() back to AbstractSearchScript #8417 (issues: #8377, #8416)
-
Clear the GroovyClassLoader cache before compiling #8062 (issues: #7658, #8073)
-
Fix NPE in ScriptService when script file with no extension is deleted #7953 (issue: #7689)
-
Scripting: Wrap groovy script exceptions in a serializable Exception object #6628 (issue: #6598)
-
- Scroll
- Search
-
-
Never cache match_all queries. #13032
-
_all: Stop NPE querying _all when it doesn’t exist #12495 (issue: #12439)
-
Free all pending search contexts if index is closed or removed #12180 (issue: #12116)
-
Release search contexts after failed dfs or query phase for dfs queries #11434 (issue: #11400)
-
Don’t truncate TopDocs after rescoring #11342 (issues: #11277, #7707)
-
Matched queries: Remove redundant and broken code #10694 (issue: #10661)
-
Make sure that named filters/ queries defined in a wrapped query/filters aren’t lost #9166 (issue: #6871)
-
Fix paging on strings sorted in ascending order. #9157 (issue: #9136)
-
Terms filter lookup caching should cache values, not filters. #9027 (issues: #1, #2)
-
Refactor term analysis for simple_query_string prefix queries #8435
-
Use ConcurrentHashMap in SCAN search to keep track of the reader states. #7499 (issue: #7478)
-
Make
ignore_unmappedwork for sorting cross-index queries. #7039 (issue: #2255) -
Query DSL: Improved explanation for match_phrase_prefix #6767 (issue: #2449)
-
The query_string cache should returned cloned Query instances. #6733 (issue: #2542)
-
Match query with operator and, cutoff_frequency and stacked tokens #6573
-
XFilteredQuery default strategy prefers query first in the deleted docs … #6254 (issue: #6247)
-
limitfilter returns wrong results if deleted document are present [ISSUE] #6234 -
Use default forceAnalyzeQueryString if no query builder is present #6217 (issue: #6215)
-
Read full message on free context #6148 (issues: #5730, #6147)
-
Search might not return on thread pool rejection #6032 (issue: #4887)
-
Scroll api reduce phase fails if shard failures occur #6022
-
Fix setting of readerGen in BytesRefOrdValComparator on nested documents. #5986
-
Replace InternalSearchResponse#EMPTY with InternalSearchResponse#empty() #5775
-
The clear scroll apis should optionally accepts a scroll_id in the request body. #5734 (issue: #5726)
-
Make sure successful operations are correct if second search phase is fast #5713
-
Do not propagate errors from onResult to onFailure. #5629
-
Fix IndexShardRoutingTable’s shard randomization to not throw out-of-bounds exceptions. #5561 (issue: #5559)
-
Convert TermQuery to PrefixQuery if PHRASE_PREFIX is set #5553 (issue: #5551)
-
Use patched version of TermsFilter to prevent using wrong cached results #5393 (issue: #5363)
-
Fix SearchContext occasionally closed prematurely #5170 (issue: #5165)
-
Exposed shard id related to a failure in delete by query #5125 (issue: #5095)
-
Fix AndDocIdSet#IteratorBasedIterator to not violate initial doc state #5070 (issue: #5049)
-
- Search Templates
- Settings
-
-
Do not swallow exceptions thrown while parsing settings #13039 (issue: #13028)
-
Add explicit check that we have reached the end of the settings stream when parsing settings #12451 (issue: #12382)
-
Medium Interval time for ResourceWatcher should be 30 seconds #12423
-
Copy the classloader from the original settings when checking for prompts #12419 (issue: #12340)
-
Replace references to ImmutableSettings with Settings #11843
-
Always normalize root paths during resolution of paths #11446 (issue: #11426)
-
Prevent changing the number of replicas on a closed index #11410 (issue: #9566)
-
Read configuration file with .yaml suffix #10909 (issue: #9706)
-
Validate number_of_shards/_replicas without index setting prefix #10701 (issue: #10693)
-
Fix handling of IndicesOptions in update settings REST API #10030
-
Reset TieredMP settings only if the value actually changed #9497 (issue: #8890)
-
cluster.routing.allocation.disk.threshold_enabledaccepts wrong values [ISSUE] #9309 -
Ensure fields are overriden and not merged when using arrays #8381 (issue: #6887)
-
Tab characters in YAML should throw an exception #8355 (issue: #8259)
-
Dynamic changes to
max_merge_countare now picked up by index throttling #8136 (issue: #8132) -
Validate create index requests' number of primary/replica shards #7496 (issue: #7495)
-
LogConfigurator resolveConfig also reads .rpmnew or .bak files #7457
-
Fix bug in PropertyPlaceholder and add unit tests #6034
-
- Shadow Replicas
- Snapshot/Restore
-
-
Improve repository verification failure message #11925 (issue: #11922)
-
Improve logging of repository verification exceptions. #11763 (issue: #11760)
-
Blob store shouldn’t try deleting the write.lock file at the end of the restore process #11517
-
Move in-progress snapshot and restore information from custom metadata to custom cluster state part #11486 (issue: #8102)
-
Sync up snapshot shard status on a master restart #11450 (issue: #11314)
-
Fix cluster state task name for update snapshot task #11197
-
Don’t reuse source index UUID on restore #10367
-
Automatically add "index." prefix to the settings are changed on restore if the prefix is missing #10269 (issue: #10133)
-
Delete operation should ignore finalizing shards on nodes that no longer exist #9981 (issue: #9924)
-
Allow deletion of snapshots with corrupted snapshot files #9569 (issue: #9534)
-
Better handling of index deletion during snapshot #9418 (issue: #9024)
-
Add validation of restored persistent settings #9051 (issue: #8830)
-
Improve snapshot creation and deletion performance on repositories with large number of snapshots #8969 (issue: #8958)
-
Switch to write once mode for snapshot metadata files #8782 (issue: #8696)
-
Restore with
wait_for_completion:trueshould wait for succesfully restored shards to get started #8545 (issue: #8340) -
Keep the last legacy checksums file at the end of restore #8358 (issue: #8119)
-
Restore of indices that are only partially available in the cluster #8341 (issue: #8224)
-
Fix snapshotting of a single closed index #8047 (issue: #8046)
-
Make it possible to delete snapshots with missing metadata file #7981 (issue: #7980)
-
Make sure indices cannot be renamed into restored aliases #7918 (issue: #7915)
-
Allow to get metadata from arbitrary commit points #7376
-
Improve recovery / snapshot restoring file identity handling #7351
-
Fix NPE in SnapshotsService on node shutdown #7322 (issue: #6506)
-
Fail restore if snapshot is corrupted #6938
-
Add ability to snapshot replicating primary shards #6139 (issue: #5531)
-
Fix for hanging aborted snapshot during node shutdown #5966 (issue: #5958)
-
Fix snapshot status with empty repository #5791 (issue: #5790)
-
Add retry mechanism to get snapshot method #5411
-
Restore of an existing index using rename doesn’t completly open the index after restore [ISSUE] #5212
-
Restore process should replace the mapping and settings if index already exists #5211 (issue: #5210)
-
Handle "true"/"false" in snapshot api for "include_global_state" #4956 (issue: #4949)
-
- Stats
-
-
Use time with nanosecond resolution calculated at the executing node #12346 (issue: #12345)
-
Failure during the fetch phase of scan should invoke the failed fetch… #12087 (issue: #12086)
-
Fix wrong reused file bytes in Recovery API reports #11965 (issue: #11876)
-
Translog: stats fail to serialize size #10105
-
Translog: make sure stats’s op count and size are in sync #10041
-
Fix open file descriptors count on Windows #9397 (issue: #1563)
-
Relax restrictions on filesystem size reporting in DiskUsage #9283 (issues: #9249, #9260)
-
Fix wrong search stats groups in indices API #8950 (issue: #7644)
-
Stats: _status with #shards >> queue capacity failing with BroadcastShardOperationFailedException [ISSUE] #7916
-
Update action returns before updating stats for
NONEoperations #7639 -
NPE in ShardStats when routing entry is not set yet on IndexShard #7358 (issue: #7356)
-
Recovery API should also report ongoing relocation recoveries #6585
-
Indices stats options #6390
-
Disabled query size estimation in percolator #5372 (issue: #5339)
-
Use num of actual threads if busiestThreads is larger #4928 (issue: #4927)
-
- Store
-
-
Ensure we mark store as corrupted if we fail to read the segments info #11230 (issue: #11226)
-
Fix NPE when checking for active shards before deletion #11110 (issue: #10172)
-
Shard not deleted after relocation if relocated shard is still in post recovery #10172 (issue: #10018)
-
Only ack index store deletion on data nodes #9672 (issue: #9605)
-
Only fail recovery if files are inconsistent #8779
-
Use Lucene checksums if segment version is >= 4.9.0 #8599 (issue: #8587)
-
Calculate Alder32 Checksums for legacy files in Store#checkIntegrity #8407
-
Add BWC layer to .si / segments_N hashing to identify segments accurately #7436 (issues: #7351, #7434)
-
Ignore segments.gen on metadata snapshots #7379
-
DistributorDirectory shouldn’t search for directory when reading existing file #7323 (issue: #7306)
-
Delete unallocated shards under a cluster state task #6902
-
Searcher might not be closed if store hande can’t be obtained #5884
-
- Suggesters
-
-
Prevent DirectCandidateGenerator to reuse an unclosed analyzer #12670
-
Ensure empty string completion inputs are not indexed #11158 (issue: #10987)
-
Ensure collate option in PhraseSuggester only collates on local shard #11156 (issue: #9377)
-
Make GeoContext mapping idempotent #10602 (issues: #10581, #8937)
-
Return an HTTP error code when a suggest request failed instead of 200 #10104
-
Fix CompletionFieldMapper to correctly parse weight #8197 (issue: #8090)
-
Infinite loop in GeolocationContextMapping [ISSUE] #7433
-
Bugs with encoding multiple levels of geo precision #7369 (issue: #7368)
-
Completion mapping type throws a misleading error on null value #6926 (issue: #6399)
-
Tie-break suggestions by term #5978
-
Fix Lucene’s getFiniteStrings to not consume Java stack #5927
-
Geo context suggester: Require precision in mapping #5647 (issue: #5621)
-
ContextSuggester: Adding couple of tests to catch more bugs #5596 (issue: #5525)
-
Category type should be called "category" instead of "field" in context suggester #5469
-
Two bugfixes for the completion format #4973
-
marvel.agent Background thread had an uncaught exception: java.lang.NullPointerException [ISSUE] #4970
-
NullPointerException (NPE) in completion suggester requests [ISSUE] #4788
-
- Term Vectors
- Top Hits
-
-
Protected against
sizeandoffsetlarger than total number of document in a shard #12518 (issue: #12510) -
Inconsistent sorting of top_hits fixed [ISSUE] #7697
-
Properly support top_hits aggregation in a nested and reverse_nested aggregations. #7164 (issue: #3022)
-
Make
_sourceparsing intop_hitsaggregation consistent with the search api #6997 -
Track scores should be applied properly for
top_hitsaggregation. #6934
-
- Translog
-
-
Ignore EngineClosedException during translog fysnc #12384
-
Don’t convert possibly corrupted bytes to UTF-8 #11911
-
Mark translog as upgraded in the engine even if a legacy generation exists #11860 (issue: #11858)
-
Translog leaks filehandles if it’s corrupted or truncated #8372
-
Better support for partial buffer reads/writes in translog infrastructure #6576 (issue: #6441)
-
Lower the translog flush triggers to workaround #6363 [ISSUE] #6377
-
- Tribe Node
- Upgrade API
Regressions
- CRUD
-
-
Indexing a document fails when setting
version=0&version_type=external[ISSUE] #5662
-
- Core
-
-
Switch back to ConcurrentMergeScheduler as the default [ISSUE] #5817
-
- Discovery
- Internal
-
-
Restore streamInput() performance over PagedBytesReference. #5589
-
- Mapping
- Network
-
-
Only resolve host if explicitly allowed. #12986
-
Upgrades
- Core
-
-
Upgrade to Lucene 5.2.1. #11662
-
Upgrade to Lucene 5.2 #11534
-
Upgrade Jackson to 2.5.3 #11307
-
Upgrade to lucene-5.2.0-snapshot-1681024 #11296
-
Upgrade to lucene-5.2.0-snapshot-1680200. #11218
-
Upgrade to lucene-5.2.0-snapshot-1678978. #11125
-
Upgrade to HPPC 0.7.1 #11035
-
Upgrade to lucene-5.2-snapshot-1675363. #10727 (issue: #10728)
-
Upgrade to Lucene 5.2 r1675100 #10699
-
Upgrade to Lucene-5.2-snapshot-1674183. #10641
-
Upgrade to Lucene 5.2 r1673726 #10612
-
Upgrade to lucene-5.2.0-snapshot-1673124. #10562
-
Update forbiddenapis to version 1.8 #10555
-
Upgrade to lucene-5.1.0-snapshot-1671894. #10468
-
Update to Lucene 5.1 snapshot r1671277 #10435
-
Upgrade to Jackson 2.5.1 #10210
-
Upgrade to Jackson 2.5.1 #10134
-
Upgrade to Lucene r1660560 #9746
-
Upgrade to lucene r1654549 snapshot #9402
-
Upgrade to lucene-5.1.0-snapshot-1652032. #9318
-
Upgrade to lucene 5 r1650327 #9206
-
Upgrade to current Lucene 5.0.0 snapshot #8588
-
Upgrade master to lucene 5.0 snapshot #8347
-
Upgrade to Lucene 4.10 #7584
-
Version bump HPPC to 0.6.0 #7139
-
Upgrade Jackson to 2.4.1 #6757
-
Upgrade to Lucene 4.9 #6623
-
Upgrade to Guava 17 #5953
-
Upgrade to Lucene 4.8.0 #5932
-
Update forbidden-apis to 1.5.1 and remove the relaxed failOnMissingClasses setting, fix typo #5863
-
Upgrade to Lucene 4.7.2 #5802
-
Update JNA to 4.1.0, properly warn on error, hint at noexec mount #5636 (issue: #5493)
-
Upgrade to Lucene 4.7.1 #5635
-
- Dates
-
-
Update joda-time to v2.7 #9610
-
- Geo
- Network
- Scripting
Appendix A: Deleted pages
The following pages have moved or been deleted.
1.1. Nodes shutdown
The _shutdown API has been removed. Instead, setup Elasticsearch to run as
a service (see Running as a Service on Linux or Running as a Service on Windows) or use the -p
command line option to write the PID to a file.
1.2. Bulk UDP API
The Bulk UDP services has been removed. Use the standard Bulk API instead.
1.3. Delete Mapping
It is no longer possible to delete the mapping for a type. Instead you should delete the index and recreate it with the new mappings.
1.4. Index Status
The index _status API has been replaced with the Indices Stats and
Indices Recovery APIs.
1.5. _analyzer
The _analyzer field in type mappings is no longer supported and will be
automatically removed from mappings when upgrading to 2.x.
1.6. _boost
The _boost field in type mappings is no longer supported and will be
automatically removed from mappings when upgrading to 2.x.
1.7. Config mappings
It is no longer possible to specify mappings in files in the config
directory. Instead, mappings should be created using the API with:
1.10. Queries
Queries and filters have been merged. Any query clause can now be used as a query in “query context” and as a filter in “filter context” (see Query DSL).
1.11. Filters
Queries and filters have been merged. Any query clause can now be used as a query in “query context” and as a filter in “filter context” (see Query DSL).
1.15. Bool Filter
The bool filter has been replaced by the Bool Query. It behaves
as a query in “query context” and as a filter in “filter context” (see
Query DSL).
1.16. Exists Filter
The exists filter has been replaced by the Exists Query. It behaves
as a query in “query context” and as a filter in “filter context” (see
Query DSL).
1.17. Missing Filter
The missing filter has been replaced by the Missing Query. It behaves
as a query in “query context” and as a filter in “filter context” (see
Query DSL).
1.18. Geo Bounding Box Filter
The geo_bounding_box filter has been replaced by the Geo Bounding Box Query.
It behaves as a query in “query context” and as a filter in “filter
context” (see Query DSL).
1.19. Geo Distance Filter
The geo_distance filter has been replaced by the Geo Distance Query.
It behaves as a query in “query context” and as a filter in “filter
context” (see Query DSL).
1.20. Geo Distance Range Filter
The geo_distance_range filter has been replaced by the Geo Distance Range Query.
It behaves as a query in “query context” and as a filter in “filter
context” (see Query DSL).
1.21. Geo Polygon Filter
The geo_polygon filter has been replaced by the Geo Polygon Query.
It behaves as a query in “query context” and as a filter in “filter
context” (see Query DSL).
1.22. Geo Shape Filter
The geo_shape filter has been replaced by the GeoShape Query.
It behaves as a query in “query context” and as a filter in “filter
context” (see Query DSL).
1.23. Geohash Cell Filter
The geohash_cell filter has been replaced by the Geohash Cell Query.
It behaves as a query in “query context” and as a filter in “filter
context” (see Query DSL).
1.24. Has Child Filter
The has_child filter has been replaced by the Has Child Query. It behaves
as a query in “query context” and as a filter in “filter context” (see
Query DSL).
1.25. Has Parent Filter
The has_parent filter has been replaced by the Has Parent Query. It behaves
as a query in “query context” and as a filter in “filter context” (see
Query DSL).
1.26. Top Children Query
The top_children query has been removed. Use the Has Child Query instead.
1.28. Indices Filter
The indices filter has been replaced by the Indices Query. It behaves
as a query in “query context” and as a filter in “filter context” (see
Query DSL).
1.29. Limit Filter
The limit filter has been replaced by the Limit Query.
It behaves as a query in “query context” and as a filter in “filter
context” (see Query DSL).
1.30. Match All Filter
The match_all filter has been replaced by the Match All Query. It behaves
as a query in “query context” and as a filter in “filter context” (see
Query DSL).
1.31. Nested Filter
The nested filter has been replaced by the Nested Query. It behaves
as a query in “query context” and as a filter in “filter context” (see
Query DSL).
1.32. Prefix Filter
The prefix filter has been replaced by the Prefix Query. It behaves
as a query in “query context” and as a filter in “filter context” (see
Query DSL).
1.33. Query Filter
The query filter has been removed as queries and filters have been merged (see
Query DSL).
1.34. Range Filter
The range filter has been replaced by the Range Query. It behaves
as a query in “query context” and as a filter in “filter context” (see
Query DSL).
1.35. Regexp Filter
The regexp filter has been replaced by the Regexp Query. It behaves
as a query in “query context” and as a filter in “filter context” (see
Query DSL).
1.36. Script Filter
The script filter has been replaced by the Script Query. It behaves
as a query in “query context” and as a filter in “filter context” (see
Query DSL).
1.37. Term Filter
The term filter has been replaced by the Term Query. It behaves
as a query in “query context” and as a filter in “filter context” (see
Query DSL).
1.38. Terms Filter
The terms filter has been replaced by the Terms Query. It behaves
as a query in “query context” and as a filter in “filter context” (see
Query DSL).
1.39. Type Filter
The type filter has been replaced by the Type Query. It behaves
as a query in “query context” and as a filter in “filter context” (see
Query DSL).
1.40. Fuzzy Like This Query
The fuzzy_like_this or flt query has been removed. Instead use
the fuzziness parameter with the
match query or the More Like This Query.
1.41. Fuzzy Like This Field Query
The fuzzy_like_this_field or flt_field query has been removed. Instead use
the fuzziness parameter with the
match query or the More Like This Query.
1.42. More Like This API
The More Like This API has been removed. Instead, use the More Like This Query.
1.43. Facets
Faceted search refers to a way to explore large amounts of data by displaying summaries about various partitions of the data and later allowing to narrow the navigation to a specific partition.
In Elasticsearch, facets are also the name of a feature that allowed to
compute these summaries. facets have been replaced by
aggregations in Elasticsearch 1.0, which are a superset
of facets.
1.44. Filter Facet
Facets have been removed. Use the
filter aggregation or
filters aggregation instead.
1.45. Query Facet
Facets have been removed. Use the
filter aggregation or
filters aggregation instead.
1.46. Geo Distance Facet
Facets have been removed. Use the
geo_distance aggregation instead.
1.47. Histogram Facet
Facets have been removed. Use the
histogram aggregation instead.
1.48. Date Histogram Facet
Facets have been removed. Use the
date_histogram aggregation instead.
1.49. Range Facet
Facets have been removed. Use the
range aggregation instead.
1.50. Terms Facet
Facets have been removed. Use the
terms aggregation instead.
1.51. Terms Stats Facet
Facets have been removed. Use the
terms aggregation
with the stats aggregation
or the extended_stats aggregation
instead.
1.52. Statistical Facet
Facets have been removed. Use the
stats aggregation
or the extended_stats aggregation instead.
1.53. Migrating from facets to aggregations
Facets have been removed. Use Aggregations instead.
1.54. Shard request cache
The shard query cache has been renamed Shard request cache.
1.55. Query cache
The filter cache has been renamed Node Query Cache.
1.56. Nested type
The docs for the nested field datatype have moved to Nested datatype.